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Abstract:  Short  summary  of  most  important  research  results  that  explain  why  the  work  was  done, 
what  was  accomplished,  and  how  it  pushed  scientific  frontiers  or  advanced  the  field.  This  summary 
will  be  used  for  archival  purposes  and  will  be  added  to  a  searchable  DoD  database. 

First,  we  addressed  the  problem  of  detecting  the  period  in  which  information  diffusion  burst 
occurs  from  a  single  observed  diffusion  sequence  under  the  assumption  that  the  delay  of  the 
information  propagation  over  a  social  network  follows  the  exponential  distribution.  To  be  more 
precise,  we  formulated  the  problem  of  detecting  the  change  points  and  finding  the  values  of  the  time 
delay  parameter  in  the  exponential  distribution  as  an  optimization  problem  of  maximizing  the 
likelihood  of  generating  the  observed  diffusion  sequence.  We  devised  an  efficient  iterative  search 
algorithm  for  the  change  point  detection  whose  time  complexity  is  almost  linear  to  the  number  of  data 
points.  We  tested  the  algorithm  against  the  real  Twitter  data  of  the  2011  Tohoku  earthquake  and 
tsunami,  and  experimentally  confirmed  that  the  algorithm  is  much  more  efficient  than  the  exhaustive 
naive  search  and  is  much  more  accurate  than  the  simple  greedy  search. 

Second,  we  addressed  the  problem  of  how  people  make  their  own  decisions  based  on  their 
neighbors’  opinions.  The  model  best  suited  to  discuss  this  problem  is  the  voter  model  and  several 
variants  of  this  model  have  been  proposed  and  used  extensively.  However,  all  of  these  models  assume 
that  people  use  their  neighbors’  latest  opinions.  Thus,  we  enhanced  the  original  voter  model  and 
defined  the  temporal  decay  voter  (TDV)  model  incorporating  a  temporary  decay  function  with 
parameters,  and  proposed  an  efficient  method  of  learning  these  parameters  from  the  observed  opinion 
diffusion  data.  We  further  proposed  an  efficient  method  of  selecting  the  most  appropriate  decay 
function  from  among  the  candidate  functions  each  with  the  optimized  parameter  values.  We  adopted 
three  functions  as  the  typical  candidates:  the  exponential  decay,  the  power-law  decay,  and  no  decay, 
and  evaluate  the  proposed  method  (parameter  learning  and  model  selection)  through  extensive 
experiments.  We,  first,  experimentally  demonstrated,  by  using  synthetic  data,  the  effectiveness  of  the 
proposed  method,  and  then  we  analyzed  the  real  opinion  diffusion  data  from  a  Japanese 
word-of-mouth  communication  site  for  cosmetics  using  three  decay  functions  above,  and  showed  that 
most  opinions  conform  to  the  TDV  model  of  the  power-law  decay  function. 

Introduction:  Include  a  summary  of  specific  aims  of  the  research  and  describe  the  importance  and 
ultimate  goal  of  the  work. 

We  focus  on  on-line  societies  including  sites  such  as  for  micro-blogging,  social  networking, 
knowledge-sharing  and  media-sharing  in  the  World  Wide  Web,  through  which  behaviors,  ideas  and 
opinions  can  spread  over  time.  Clearly,  the  information  diffusion  and  its  contents  evolution  processes 
in  these  on-line  societies  also  reflects  complex  social  structures  and  distributed  social  interests.  Thus, 
it  is  worth  putting  some  effort  to  attempt  to  find  empirical  regularities  and  develop  explanatory 
accounts  of  human  communication  in  these  sites.  Such  attempts  would  be  valuable  for  understanding 
social  structures  and  trends,  and  inspire  us  to  discover  new  knowledge  and  provide  insights  into 
underlying  human  communication.  Our  ultimate  goal  of  this  project  is  to  develop  learnable  models  for 
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rumor  spreading  and  its  associated  user  behavior  in  a  micro-blogosphere.  We  believe  that  our  research 
outcome  helps  understanding  fundamental  mechanisms  of  information  diffusion  and  evolution 
processes  in  our  society.  Moreover,  it  is  highly  expected  that  this  kind  of  mathematical  studies  using 
large-scale  networks  such  as  a  micro-blog  communication  network  can  bridge  a  gap  between 
empirical  social  networks  analyses  and  fundamental  mathematics. 

Experiment:  Description  of  the  experiment(s)/theory  and  equipment  or  analyses. 

The  information  diffusion  data  we  used  for  evaluation  were  extracted  from  201,297,161 
tweets  of  1,088,040  Twitter  users  who  tweeted  at  least  200  times  during  the  three  weeks  from  March  5 
to  24,  2011  that  includes  March  1 1,  the  day  of  201 1  Tohoku  earthquake  and  tsunami.  It  is  conceivable 
to  use  a  retweet  sequence  in  which  a  user  sends  out  other  user’s  tweet  without  any  modification.  But 
there  exists  multiple  styles  of  retweeting  (official  retweet  and  unofficial  retweet),  and  it  is  very 
difficult  to  accurately  extract  a  sequence  of  tweets  in  an  automatic  manner  considering  all  of  these 
different  styles.  Therefore,  in  our  experiments,  noting  that  each  retweet  includes  the  ID  of  the  user 
who  sent  out  the  original  tweet  in  the  form  of  “@ID”,  we  extracted  tweets  that  include  @ID  format  of 
each  user  ID  and  constructed  a  sequence  data  for  each  user.  More  precisely,  we  used  information 
diffusion  sequences  of  798  users  for  which  the  length  of  sequences  are  more  than  5,000  (number  of 
tweets).  Note  that  each  diffusion  sequence  includes  retweet  sequences  on  multiple  topics.  Since  we  do 
not  know  the  ground  tmth  of  the  change  points  for  each  sequence  if  there  are  changes  in  it,  we  used 
the  naive  method  which  exhaustively  searches  for  all  the  possible  combinations  of  the  change  points 
as  giving  the  ground  truth.  We  had  to  limit  the  number  of  change  points  to  2  (/=  2)  in  order  for  the 
naive  method  to  return  the  solution  in  a  reasonable  amount  of  computation  time. 

The  opinion  formation  data  we  used  for  evaluation  were  collected  from  “@cosme”,  which  is 
a  Japanese  word-of-mouth  communication  website  for  cosmetics.  In  @cosme,  a  user  can  post  a 
review  and  give  a  score  of  each  brand  (one  from  1  to  7).  When  one  user  registers  another  user  as 
his/her  favorite  user,  a  “fan-link”  is  created  between  them.  We  traced  up  to  ten  steps  in  the  fan-links 
from  a  randomly  chosen  user  in  December  2009,  and  collected  a  set  of  ( b ,  k,  t,  v)’s,  where  ( b ,  k,  t,  v) 
means  that  user  v  scored  brand  b  k  points  at  time  1.  The  number  of  brands  was  7,139,  the  number  of 
users  was  45,024,  and  the  number  of  reviews  posted  was  331,084.  For  each  brand  b,  we  regarded  the 
point  k  scored  by  a  user  v  as  the  opinion  k  of  v,  and  constructed  the  opinion  diffusion  sequence  Dm  (b) 
consisting  of  3-tuple  ( k ,  t,  v).  In  particular,  we  focused  on  these  brands  in  which  the  number  of 
samples  N  =  \ Dn  (b) |  was  greater  than  500.  Then,  the  number  of  brands  was  120.  We  refer  to  this 
dataset  as  the  @cosme  dataset. 

Results  and  Discussion:  Describe  significant  experimental  and/or  theoretical  research  advances  or 
findings  and  their  significance  to  the  field  and  what  work  may  be  performed  in  the  future  as  a  follow 
on  project.  Fellow  researchers  will  be  interested  to  know  what  impact  this  research  has  on  your 
particular  field  of  science. 

Information  diffusion'.  By  analyzing  the  real  information  diffusion  data,  we  revealed  that 
even  if  the  data  contains  tweets  talking  about  plural  topics,  the  detected  burst  period  tends  to  contain 
tweets  on  a  specific  topic  intensively.  In  addition,  we  experimentally  confirmed  that  assuming  the 
information  diffusion  path  to  be  the  line  shape  tree  results  in  much  better  approximation  of  the 
maximum  likelihood  estimator  than  assuming  it  to  be  the  star  shape  tree.  This  is  a  good  heuristic  to 
accurately  estimate  the  change  points  when  the  actual  diffusion  path  is  not  known  to  us.  These  results 
indicate  that  it  is  possible  to  detect  and  identify  both  the  burst  period  and  the  topic  diffused  without 
extracting  the  tweet  sequence  for  each  topic  and  identifying  the  diffusion  paths  for  each  sequence,  and 
the  proposed  method  can  be  a  useful  tool  to  analyze  a  huge  amount  of  information  diffusion  data.  Our 
immediate  future  work  is  to  compare  the  proposed  method  with  existing  burst  detection  methods  that 
are  designed  for  data  stream.  We  also  plan  to  devise  a  method  of  finding  nodes  that  caused  the  bust 
based  on  the  change  points  detected. 

Opinion  formation'.  We  first  tested  the  proposed  algorithms  by  synthetic  datasets  assuming 
that  there  are  two  decay  models:  the  exponential  decay  and  the  power-law  decay.  We  confirmed  that 
the  learning  algorithm  correctly  identifies  the  parameter  values  and  the  model  selection  algorithm 
correctly  identifies  which  model  the  data  came  from.  We  then  applied  the  method  to  the  real  opinion 


diffusion  data  taken  from  a  Japanese  word-of-mouth  communication  site  for  cosmetics,  i.e.,  the 
@cosme  dataset.  We  used  the  two  decay  functions  above  and  added  no  decay  function  as  a  baseline. 
The  result  of  the  analysis  revealed  that  opinions  of  most  of  the  brands  conform  to  the  TDV  model  of 
the  power-law  decay  function.  We  found  this  interesting  because  this  is  consistent  with  the 
observation  that  many  human  actions  are  related  to  the  power-law.  Some  brands  showed  behaviors 
characteristic  to  the  brands,  e.g.,  the  older  brand  that  releases  new  product  less  frequently  naturally 
follows  no  decay  TDV  and  the  newer  brand  that  releases  new  product  more  frequently  naturally 
follows  the  power-law  decay  TDV  with  large  decay  constant,  which  are  all  well  interpretable. 

List  of  Publications  and  Significant  Collaborations  that  resulted  from  your  AOARD  supported 
project:  In  standard  format  showing  authors,  title,  journal,  issue,  pages,  and  date,  for  each  category 
list  the  following: 

a)  papers  published  in  peer-reviewed  journals, 

1.  Masahiro  Kimura,  Kazumi  Saito,  Kouzou  Ohara  and  Hiroshi  Motoda,  "Learning  to  predict 
opinion  share  and  detect  anti-majority  opinionists  in  social  networks"  to  appear  in  Journal  of 
Intelligent  Information  Systems  (JUS). 

2.  Kazumi  Saito,  Masahiro  Kimura,  Kouzou  Ohara,  and  Hiroshi  Motoda,  "Detecting  Changes  in 
Information  Diffusion  Pattern  over  Social  Network,"  to  appear  in  ACM  Transactions  on 
Intelligent  Systems  and  Technology  (TIST). 

b)  papers  published  in  peer-reviewed  conference  proceedings, 

1.  Shoko  Kato,  Akihiro  Koide,  Takayasu  Fushimi,  Kazumi  Saito  and  Hiroshi  Motoda,  "Network 
Analysis  of  Three  Twitter  Functions:  Favorite,  Follow  and  Mention,"  to  appear  in  Proc.  of  the 
2012  Pacific  Rim  Knowledge  Acquisition  Workshop  (PKAW2012). 

2.  Takayasu  Fushimi,  Kazumi  Saito  and  Kazuhiro  Kazama  "Extracting  Communities  in  Networks 
based  on  Functional  Properties  of  Nodes,"  to  appear  in  Proc.  of  the  2012  Pacific  Rim  Knowledge 
Acquisition  Workshop  (PKAW2012). 

3.  Masahiro  Kimura,  Kazumi  Saito,  Kouzou  Ohara  and  Hiroshi  Motoda,  "Opinion  Formation  by 
Voter  Model  with  Temporal  Decay  Dynamics,"  Proc.  of  the  European  Conference  on  Machine 
Learning  and  Principles  and  Practice  of  Knowledge  Discovery  in  Databases 
(ECML-PKDD2012),  pp.  565-580,  2012. 

4.  Kazumi  Saito,  Kouzou  Ohara,  Masahiro  Kimura  and  Hiroshi  Motoda,  "Burst  Detection  in  a 
Sequence  of  Tweets  based  on  Information  Diffusion  Model,"  Proc.  of  the  Fifteenth  International 
Conference  on  Discovery  Science  (DS2012),  pp.  239—253,  2012. 

5.  Kazumi  Saito,  Masahiro  Kimura,  Kouzou  Ohara  and  Hiroshi  Motoda,  "Graph  Embedding  on 
Spheres  and  its  Application  to  Visualization  of  Information  Diffusion  Data,"  Proc.  of  the 
International  Workshop  on  Mining  Social  Network  Dynamics  (MSND2012),  pp.  1137—1144, 
2012. 

6.  Kouzou  Ohara,  Kazumi  Saito,  Masahiro  Kimura,  and  Hiroshi  Motoda,  "Effect  of  In/Out-Degree 
Correlation  on  Influence  Degree  of  Two  Contrasting  Information  Diffusion  Models,"  Proc.  of  the 
2012  International  Conference  on  Social  Computing,  Behavioral  Modeling,  and  Prediction 
(SBP2012),  pp.  131-138,  2012. 

7.  Takayasu  Fushimi,  Yamato,  Kubota,  Kazumi  Saito,  Masahiro  Kimura,  Hiroshi  Motoda,  and 
Kouzou  Ohara,  "Speeding  up  Bipartite  Graph  Visualization  Method,"  Proc.  of  the  24th 
Australasian  Joint  Conference  on  Artificial  Intelligence  (AI2011),  pp. 697— 706,  2011. 

c)  papers  published  in  non-peer-reviewed  journals  and  conference  proceedings, 

d)  conference  presentations  without  papers, 

e)  manuscripts  submitted  but  not  yet  published,  and 

f)  provide  a  list  any  interactions  with  industry  or  with  Air  Force  Research  Laboratory  scientists  or 
significant  collaborations  that  resulted  from  this  work. 

None. 


Attachments:  Publications  a),  b)  and  c)  listed  above  if  possible. 


DD882:  As  a  separate  document,  please  complete  and  sign  the  inventions  disclosure  form. 


Important  Note:  If  the  work  has  been  adequately  described  in  refereed  publications,  submit  an 
abstract  as  described  above  and  cite  important  findings  to  your  above  List  of  Publications.  If  a  full 
report  needs  to  be  written,  then  submission  of  a  final  report  that  is  very  similar  to  a  frill  length  journal 
article  will  be  sufficient  in  most  cases.  This  document  may  be  as  long  or  as  short  as  needed  to  give  a 
fair  account  of  the  work  performed  during  the  period  of  performance.  There  will  be  variations 
depending  on  the  scope  of  the  work.  As  such,  there  is  no  length  or  formatting  constraints  for  the  final 
report.  Keep  in  mind  the  amount  of  funding  you  received  relative  to  the  amount  of  effort  you  put 
into  the  report.  For  example,  do  not  submit  a  $300k  report  for  $50k  worth  of  funding;  likewise,  do 
not  submit  a  $50k  report  for  $300k  worth  of  funding.  Include  as  many  charts  and  figures  as  required 
to  explain  the  work. 


J  Intell  Inf  Syst  manuscript  No. 

(will  be  inserted  by  the  editor) 


Learning  to  predict  opinion  share  and  detect  anti-majority 
opinionists  in  social  networks 


Masahiro  Kimura  Kazumi  Saito 
•  Kouzou  Ohara  •  Hiroshi  Motoda 


Received:  date  /  Accepted:  date 


Abstract  We  address  the  problem  of  detecting  anti-majority  opinionists  using  the  value- 
weighted  mixture  voter  (VwMV)  model.  This  problem  is  motivated  by  the  fact  that  1)  each 
opinion  has  its  own  value  and  an  opinion  with  a  higher  value  propagates  more  easily/rapidly 
and  2)  there  are  always  people  who  have  a  tendency  to  disagree  with  any  opinion  expressed 
by  the  majority.  We  extend  the  basic  voter  model  to  include  these  two  factors  with  the  value 
of  each  opinion  and  the  anti-majoritarian  tendency  of  each  node  as  new  parameters,  and 
learn  these  parameters  from  a  sequence  of  observed  opinion  data  over  a  social  network. 
We  experimentally  show  that  it  is  possible  to  learn  the  opinion  values  correctly  using  a 
short  observed  opinion  propagation  data  and  to  predict  the  opinion  share  in  the  near  future 
correctly  even  in  the  presence  of  anti-majoritarians,  and  also  show  that  it  is  possible  to 
learn  the  anti-majoritarian  tendency  of  each  node  if  longer  observation  data  is  available. 
Indeed,  the  learned  model  can  predict  the  future  opinion  share  much  more  accurately  than 
a  simple  polynomial  extrapolation  can  do.  Ignoring  these  two  factors  substantially  degrade 
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the  performance  of  share  prediction.  We  also  show  theoretically  that,  in  a  situation  where 
the  local  opinion  share  can  be  approximated  by  the  average  opinion  share,  1)  when  there 
are  no  anti-majoritarians,  the  opinion  with  the  highest  value  eventually  takes  over,  but  2) 
when  there  are  a  certain  fraction  of  anti-majoritarians,  it  is  not  necessarily  the  case  that  the 
opinion  with  the  highest  value  prevails  and  wins,  and  further,  3)  in  both  cases,  when  the 
opinion  values  are  uniform,  the  opinion  share  prediction  problem  becomes  ill-defined  and 
any  opinion  can  win.  The  simulation  results  support  that  this  holds  for  typical  real  world 
social  networks.  These  theoretical  results  help  understand  the  long  term  behavior  of  opinion 
propagation. 

Keywords  Social  networks  •  Opinion  dynamics  •  Parameter  learning 


1  Introduction 

The  emergence  of  large  scale  social  computing  applications  has  made  massive  social  net¬ 
work  data  available,  and  large  networks  formed  by  these  services  play  an  important  role 
as  a  medium  for  spreading  diverse  information  including  news,  ideas,  opinions,  and  ru¬ 
mors  (Newman  et  al,  2002;  Newman,  2003;  Gruhl  et  al.  2004;  Domingos,  2005).  Thus,  in¬ 
vestigating  the  spread  of  influence  in  social  networks  has  been  the  focus  of  attention  (Leskovec 
et  al,  2007a;  Crandall  et  al,  2008;  Wu  and  Huberman,  2008;  Romero  et  al,  2011;  Bakshy 
et  al,  2011;  Mathioudakis  et  al,  201 1). 

The  most  well  studied  problem  would  be  the  influence  maximization  problem ,  that  is,  the 
problem  of  finding  a  limited  number  of  influential  nodes  that  are  effective  for  spreading  in¬ 
formation  through  the  network.  Many  new  algorithms  that  can  effectively  find  approximate 
solutions  have  been  proposed  both  for  estimating  the  expected  influence  and  for  finding 
good  candidate  nodes  under  different  model  assumptions,  e.g.,  descriptive  probabilistic  in¬ 
teraction  models  (Domingos  and  Richardson,  2001;  Richardson  and  Domingos,  2002),  and 
basic  diffusion  models  such  as  the  independent  cascade  (IC)  model  and  the  linear  threshold 
(LT)  model  (Kempe  et  al,  2003;  Kimura  et  al.  2010a;  Leskovec  et  al,  2007b;  Chen  et  al, 
2009,  2010).  This  problem  has  good  applications  in  sociology  and  “viral  marketing”  (Agar- 
wal  and  Liu,  2008).  However,  the  models  used  above  allow  a  node  in  the  network  to  take 
only  one  of  the  two  states,  i.e.,  either  active  or  inactive,  because  the  focus  is  on  influence. 

Applications  such  as  an  on-line  competitive  service  in  which  a  user  can  choose  one  from 
multiple  choices  and  decisions,  however,  require  a  different  approach  where  a  model  must 
handle  multiple  states.  The  model  best  suited  for  this  kind  of  analysis  would  be  a  voter 
model,  which  is  the  model  to  analyze  how  different  opinions  spread  over  a  social  network. 

It  is  one  of  the  most  basic  stochastic  process  model,  and  has  the  same  key  property  with  the 
linear  threshold  (LT)  model  in  that  a  node  decision  is  influenced  by  its  neighbor’s  decision, 
i.e.,  a  person  changes  its  opinion  by  the  opinions  of  its  neighbors.  In  the  basic  voter  model 
which  is  defined  on  an  undirected  network,  each  node  initially  holds  one  of  the  two  opinions, 
e.g.,  yes  or  no,  and  adopts  the  opinion  of  a  randomly  chosen  neighbor  at  each  subsequent 
discrete  time-step. 

In  this  paper,  we  address  the  problem  of  opinion  formation  by  using  an  extended  voter 
model  for  which  multiple  states  are  needed.  There  are  three  extensions.  As  described  above, 
the  original  voter  model  can  handle  only  two  opinions  and  assumes  discrete  time-step.  We 
extended  the  basic  voter  model  to  be  able  to  handle  K  opinions  and  to  allow  asynchronous 
opinion  update.  This  is  just  to  make  the  basic  voter  model  to  be  more  realistic  and  this  ex¬ 
tension  is  straightforward.  Indeed,  the  actual  opinion  update  is  asynchronous  and  if  we  are 
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to  use  the  observed  data,  synchronous  discrete  time-step  model  does  not  work.  The  other 
two  extensions  are  more  fundamental.  We  note  that  when  we  have  to  make  a  decision  from 
multiple  choices,  we  consider  the  value  of  each  choice,  e.g.,  quality,  brand,  authority,  etc. 
because  this  definitely  affects  our  choice.  The  same  is  true  for  opinion  formation.  We  listen 
to  and  evaluate  what  our  neighbors  say  and  change  our  opinions.  Thus,  the  second  exten¬ 
sion  is  to  incorporate  the  value  of  each  opinion  as  a  new  parameter.  The  extended  model 
is  referred  to  as  the  value-weighted  voter  (VwV)  model  with  multiple  opinions.  Same  as  the 
basic  voter  model,  the  VwV  model  assumes  that  people  naturally  tend  to  follow  their  neigh¬ 
bors’  majority  opinion.  However,  we  note  that  there  are  always  people  who  do  not  agree 
with  the  majority  and  support  the  minority  opinion,  which  was  also  addressed  in  Gill  and 
Gainous  (2002)  and  Arenson  (1996).  We  are  interested  in  how  this  affects  the  opinion  share. 
Thus,  the  third  extension  is  to  include  this  anti-majority  effect  by  linearly  combining  the 
VwV  model  and  the  anti-majority  model  with  the  anti-majoritarian  tendency  of  each  node 
as  a  new  parameter.  The  extended  model  is  referred  to  as  the  value-weighted  mixture  voter 
(VwMV)  model.  We  will  discuss  how  to  learn  these  parameters  from  the  observed  opinion 
propagation  data  and  how  accurately  the  learned  model  can  predict  the  future  opinion  share. 

There  has  been  a  variety  of  work  on  the  voter  model.  Liggett  (1999)  and  Sood  and 
Redner  (2005)  extensively  studied  dynamical  properties  of  the  basic  model,  including  how 
the  degree  distribution  and  the  network  size  affect  the  mean  time  to  reach  consensus,  from 
a  mathematical  point  of  view.  Castellano  et  al  (2009)  and  Yang  et  al  (2009)  investigated 
several  variants  of  the  voter  model  and  analyzed  non-equilibrium  phase  transition  from  a 
physics  point  of  view.  Holme  and  Newman  (2006)  and  Crandall  et  al  (2008)  extended  the 
voter  model  to  combine  it  with  a  network  evolution  model.  These  studies  gave  insights  into 
the  fundamentals  of  the  vote  model,  but  their  focuses  are  different  from  what  this  paper  in¬ 
tends  to  address,  i.e.,  parameter  learning  from  the  data  and  share  prediction  at  a  specific  time 
T  with  opinion  values  and  anti-majoritarian  tendency  considered.  Even-Dar  and  Shapira 
(2007),  whose  work  we  think  has  a  similar  goal  to  ours  in  spirit,  investigated  the  influence 
maximization  problem  (maximizing  the  spread  of  the  opinion  that  supports  a  new  technol¬ 
ogy)  at  a  given  target  time  T  under  the  basic  voter  model,  i.e.,  with  two  opinions  (one  in 
favor  of  the  new  technology  and  the  other  against  it).  They  showed  that  the  most  natural 
heuristic  solution,  which  picks  the  nodes  in  the  network  with  the  highest  degree,  is  indeed 
the  optimal  solution,  under  the  condition  that  all  nodes  have  the  same  cost.  This  work  is 
close  to  ours  in  that  it  measures  the  influence  at  a  specific  time  T,  but  is  different  in  all 
others  (no  share  prediction,  no  value  and  anti-majoritarian  tendency  considered,  no  more 
than  two  opinions,  no  asynchronous  update  and  no  learning).  We  should  mention  that  we 
are  not  the  first  to  introduce  the  notion  of  anti-majority.  There  is  a  model  called  anti-voter 
model  where  only  two  opinions  are  considered  (Huber  and  Reinert,  2004;  Donnelly  and 
Welsh,  1984;  Matloff,  1977).  Each  one  chooses  one  of  its  neighbors  randomly  and  decides 
to  take  the  opposite  opinion  of  the  neighbor  chosen.  Rollin  (2007)  analyzed  the  statistical 
property  of  the  anti-voter  model  introducing  the  notion  of  exchangeable  pair  couplings.  Our 
work  is  different  from  theirs,  apart  from  the  learning  mechanism  and  being  able  to  handle 
multiple  opinions,  in  that  we  consider  the  effect  of  both  the  voter  and  the  anti-voter  models 
by  introducing  the  anti-majoritarian  tendency  of  each  node  as  a  new  parameter. 

This  paper  is  an  extension  and  integration  of  what  we  have  reported  in  Kimura  et  al 
(2010b)  and  Kimura  et  al  (2011).  In  the  former  we  addressed  the  problem  of  predicting  the 
opinion  share  at  a  future  time  (before  an  consensus  is  reached)  by  learning  the  opinion  values 
from  a  limited  amount  of  past  observed  opinion  diffusion  data  using  the  VwV  model.  In  the 
latter  we  introduced  the  VwMV  model  and  mainly  focused  on  the  learning  performance  of 
the  anti-majoritarian  tendency  of  each  node.  In  this  paper  we  extend  our  preliminary  work 
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and  analyze  the  share  prediction  performance  of  the  VwMV  model  when  both  the  opinion 
values  and  the  anti-majoritarian  tendency  are  not  known  and  have  to  be  learned  from  the 
observed  opinion  propagation  data,  investigate  how  the  average  anti-majoritarian  tendency 
affects  the  learning  performance,  and  detect  who  the  anti-majoritarians  are.  In  particular,  we 
seek  for  the  answer  for  the  following  questions:  what  the  opinion  share  will  be  in  the  near 
future,  given  only  the  limited  amount  of  observed  data,  how  easy  it  is  to  learn  both  opinion 
value  and  anti-majority  tendency,  and  how  much  the  observed  data  is  required  to  learn  and 
identify  the  anti-majoritarians  accurately  enough.  It  is  important  to  learn  the  model  quickly 
and  predict  what  will  happen  in  the  near  future  when  a  new  opinion  appears.  The  model  is 
too  simple  to  accurately  predict  the  far  future.  For  this,  it  is  more  desirable  to  understand  the 
asymptotic  behavior  by  a  theoretical  analysis. 

We  conjecture  that  learning  opinion  values  is  easy  because  the  number  of  opinion  K 
is  not  many  (order  of  tens),  but  learning  anti-majoritarian  tendency  is  not  easy  because  the 
tendency  is  associated  with  each  node  and  the  number  of  nodes  is  huge  (order  of  ten  thou¬ 
sands  or  more).  We  further  conjecture  that  predicting  the  opinion  share  is  much  easier  than 
identifying  the  anti-majoritarians  because  the  former  is  a  macroscopic  quantity  over  the 
whole  network  but  the  latter  is  defined  for  each  node.  We  show  that  both  the  parameters, 
anti-majoritarian  tendency  and  opinion  value,  can  be  learned  by  an  iterative  algorithm  that 
maximizes  the  likelihood  of  the  model’s  generating  the  observed  data,  and  confirmed  the 
above  conjectures  by  experiments.  We  tested  the  algorithm  for  four  real  world  social  net¬ 
works  with  size  ranging  over  4,000  to  12,000  nodes  and  40,000  to  250,000  links,  and  showed 
that  the  parameter  value  update  algorithm  correctly  identifies  both  the  values  of  opinions  and 
the  anti-majoritarian  tendency  of  each  node  under  various  situations.  The  opinion  values  can 
be  learned  in  good  accuracy  with  a  small  amount  of  data,  but  the  anti-majoritarian  tendency 
needs  a  sufficiently  large  amount  of  data  to  improve  the  accuracy.  Use  of  the  learned  model 
can  predict  the  opinion  share  in  the  near  future  very  accurately  despite  the  existence  of 
anti-majoritarians.  The  theoretical  analysis  under  the  assumption  in  which  the  local  opin¬ 
ion  share  can  be  approximated  by  the  average  opinion  share  shows  that  1)  when  there  are 
no  anti-majoritarians,  the  opinion  with  the  highest  value  eventually  takes  over,  but  2)  when 
there  is  a  certain  fraction  of  anti-majoritarians,  it  is  not  necessarily  the  case  that  the  opinion 
with  the  highest  value  prevails  and  wins,  and  further,  3)  in  both  cases,  when  the  opinion  val¬ 
ues  are  uniform,  the  opinion  share  prediction  problem  becomes  ill-defined  and  any  opinion 
can  win,  and  these  are  also  supported  by  real  world  networks  in  which  the  above  assump¬ 
tion  does  not  hold.  We  want  to  emphasize  that  it  is  crucially  important  to  explicitly  model 
the  anti-majority  effect  to  obtain  good  results.  Predicting  the  share  by  VwV  model  when 
there  are  anti-majoritarians  does  not  work.  There  seems  to  be  no  simple  way  to  estimate 
the  anti-majoritarian  tendency.  The  heuristic  that  simply  counts  the  number  of  opinion  up¬ 
dates  in  which  the  chosen  opinion  is  the  same  as  the  minority  opinion  gives  only  a  very 
poor  approximation.  These  results  show  that  the  model  learned  by  the  proposed  algorithm 
can  be  used  to  predict  the  future  opinion  share  and  provides  a  way  to  analyze  such  prob¬ 
lems  as  influence  maximization  or  minimization  for  opinion  diffusion  under  the  presence  of 
anti-majoritarians. 

The  paper  is  organized  as  follows.  We  introduce  the  basic  voter  and  anti-voter  models 
in  Section  2  and  our  proposed  models,  VwV  and  VwMV  models  in  Section  3.  We  then 
perform  the  behavior  analysis  for  share  prediction  using  the  mean  field  theory  and  discuss 
the  behavior  qualitatively  in  Section  4,  and  describe  the  parameter  learning  algorithm  in 
Section  5.  We  detail  the  results  of  experimental  evaluations  in  Section  6.  We  summarize 
what  has  been  achieved  and  conclude  the  paper  in  Section  7. 
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2  Voter  Models 

We  consider  the  diffusion  of  opinions  in  a  social  network  represented  by  an  undirected 
(bidirectional)  graph  G  =  ( V,E )  with  self-loops,  where  V  and  E  (c  V  x  V)  are  the  sets  of  all 
the  nodes  and  links  in  the  network,  respectively.  For  a  node  v  e  V,  let  r(v)  denote  the  set  of 
neighbors  of  v  in  G,  i.e., 

E(v)  =  {u  £  V;  (m,v)  e  E). 

Note  that  v  e  r(v).  We  revisit  the  basic  voter  model  that  is  one  of  the  standard  models  of 
opinion  dynamics,  and  the  anti-voter  model  that  is  its  variant,  where  the  number  of  opinions 
is  set  to  two. 


2.1  Basic  Voter  Model 


According  to  the  work  of  Even-Dar  and  Shapira  (2007),  we  recall  the  definition  of  the  ba¬ 
sic  voter  model  on  network  G.  In  the  model,  each  node  of  G  is  endowed  with  two  states; 
opinions  1  and  2.  The  opinions  are  initially  assigned  to  all  the  nodes  in  G,  and  the  evolution 
process  unfolds  in  discrete  time-steps  t  =  1,2,3,  ••  •  as  follows:  At  each  time-step  t,  each 
node  v  picks  a  random  neighbor  u  and  adopts  the  opinion  that  u  holds  at  time-step  t—  1. 

More  formally,  let  f,  :  V  — » {1,2}  denote  the  opinion  distribution  at  time-step  t,  where 
f,(v)  stands  for  the  opinion  of  node  v  at  time-step  t.  Then,  /o  :  V  — >  { 1 , 2}  is  the  initial  opinion 
distribution,  and  f, :  V  — >  { 1 , 2}  is  inductively  defined  as  follows:  For  any  v  e  V ,  node  v  selects 
its  opinion  according  to  the  probability  distribution. 


P(/,(v)=l)  = 


P(f,(v)  =  2)  = 


Ni(t-  l,v) 
IHv)| 

N2(t-  l,v) 
|r(v)| 


where  Nk(t,v)  is  the  number  of  v’s  neighbors  that  hold  opinion  k  at  time-step  t  for  k  =  1,2. 


2.2  Anti-voter  Model 


In  the  basic  voter  model,  it  is  assumed  that  people  tend  to  follow  their  neighbors’  majority 
opinion.  However,  since  it  is  a  common  phenomenon  that  there  are  always  people  who  do  not 
agree  with  the  majority  and  support  the  minority  opinion,  the  anti-voter  model  is  defined  and 
investigated  (Huber  and  Reinert,  2004;  Rollin,  2007;  Donnelly  and  Welsh,  1984;  Matloff, 
1977).  In  the  anti-voter  model,  the  opinion  evolution  process  is  replaced  as  follows:  At  each 
time-step  t,  each  node  v  picks  a  random  neighbor  u  and  changes  its  opinion  to  the  opposite 
of  the  opinion  that  u  holds  at  time-step  t—  1,  i.e.,  node  v  selects  its  opinion  according  to  the 
probability  distribution. 


P(f,(v)  =  I ) 


N2(t-  l,v) 
IHv)| 


P(ft(v)  =  2) 


JVi(f-l.v) 

inv)i 


We  note  that  each  individual  tends  to  adopt  the  minority  opinion  among  its  neighbors  in¬ 
stead. 
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3  Proposed  Model 

3.1  Value- weighted  Voter  Model 

We  extend  the  basic  voter  model  to  the  value-weighted  voter  (VwV)  model  for  our  purpose. 
In  the  VwV  model,  the  total  number  of  opinions  is  set  to  K  (>  2),  and  each  node  of  G 
is  endowed  with  (K  +  1)  states;  opinions  1,  •••,  K,  and  neutral  (i.e.,  no-opinion  state).  We 
consider  that  a  node  is  active  when  it  holds  an  opinion  k,  and  a  node  is  inactive  when  it  does 
not  have  any  opinion  (i.e.,  when  its  state  is  neutral).  We  assume  that  nodes  never  switch  their 
states  from  active  to  inactive.  In  order  to  discuss  the  competitive  diffusion  of  K  opinions, 
we  introduce  the  parameter  wk  (>  0)  for  each  opinion  k,  which  is  referred  to  as  the  opinion 
value  of  opinion  k.  In  the  same  way  as  the  basic  voter  model,  let  f,  :  V  — >  {0, 1,2,  ■  •  • ,  K\ 
denote  the  opinion  distribution  at  time  t,  where  opinion  0  denotes  the  neutral  state.  Here, 
ft  is  defined  for  any  non-negative  real  number  t  since  the  VwV  model  incorporates  time 
delay  in  an  asynchronous  way,  i.e.,  t  is  continuous.  For  any  t  >  0,  let  (fit(v)  denote  the  latest 
opinion  of  node  v  (before  time  t),  and  let  nk(t,  v)  denote  the  number  of  v’s  neighbors  that 
hold  opinion  k  as  the  latest  opinion  (before  time  t),  i.e., 

nk(t,v)  =  \(u  e  r(v);  ip,(u )  =  k}\. 


We  define  the  evolution  process  of  the  VwV  model.  At  the  initial  time  t  =  0,  each  opinion 
is  assigned  to  only  one  node  and  all  other  nodes  are  in  the  neutral  state.  1  Given  a  target  time 
T,  the  evolution  process  unfolds  in  the  following  steps: 

1.  Each  node  v  independently  decides  the  next  update  time  f  at  its  update  time  t  according 
to  some  probability  distribution  such  as  an  exponential  distribution  with  parameter  = 
1,  2  where  the  first  update  time  is  t  =  0  for  every  node. 

2.  At  update  time  t,  the  node  v  selects  its  opinion  according  to  the  probability  distribution, 

P(ft(v)  =  k)  =  pk(t,v,w),  (k=l,---,K),  (1) 


where  w  =  (wi,  •  •  ■ ,  wk)  and 


pk(t,v,w) 


wknk(t,v) 

— 7 - (k=l,—  ,K). 

Zj=iWjnj(t,v) 


(2) 


3.  The  process  is  repeated  from  the  initial  time  t  -  0  until  the  next  update-time  passes  a 
given  final-time  T . 

Note  that  the  basic  voter  model  with  K  opinions  is  derived  from  the  VwV  model  with  uni¬ 
form  opinion  values  w\  =  •  •  •  =  wk- 

1  This  may  look  a  rather  unnatural  assumption  because  it  is  unlikely  that  all  the  different  opinions  are 
initiated  at  the  same  time.  Since  each  opinion  is  initiated  by  a  single  person  and  the  goal  is  to  see  how  it  is 
propagated,  it  should  be  allowed  that  each  opinion  is  assigned  to  only  one  node  and  all  the  remaining  nodes 
are  in  neutral  states,  i.e.,  unaffected  by  any  opinion  yet.  We  could  have  changed  the  timing  of  each  opinion’s 
initial  utterance,  but  chose  the  simplest  case. 

2  This  assumes  that  the  average  delay  time  is  1 . 
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3.2  Value-weighted  Mixture  Voter  Model 


Since  the  anti-voter  model  aims  to  represent  the  phenomenon  that  people  tend  to  follow 
their  neighbors’  minority  opinion,  the  anti-voter  model  with  K  opinions  can  be  defined  by 
replacing  Eq.  (1)  of  the  VwV  model  with 


=  k) 


1 

K-  1 


f,- 


fik(.t,v) 
If=t  nj(t,v)j 


Therefore,  we  can  also  extend  the  anti-voter  model  with  K  opinions  to  the  value-weighted 
anti-voter  model  by  replacing  Eq.  (1)  with 


P(ft(v)  =  k) 


\-pk(t,v,w) 
K-  1 


(k= 


For  our  purpose,  we  extend  the  VwV  model  and  define  the  value-weighted  mixture  voter 
(VwMV)  model  by  replacing  Eq.  (1)  with 


P(ft(v)  =  k) 


(1  -av)pk(t,v,w)  +  av 


l~Pk(t,v,w ) 

K-  1 


(k= 


(3) 


where  av  is  a  parameter  with  0  <  av  <  1.  Note  that  each  individual  located  at  node  v  tends 
to  behave  like  a  majoritarian  if  the  value  of  av  is  small,  and  tends  to  behave  like  an  anti- 
majoritarian  if  the  value  of  av  is  large.  Therefore,  we  refer  to  orv  as  the  anti-majoritarian 
tendency  of  node  v. 


4  Behavior  Analysis 

In  what  follows,  we  first  mathematically  define  the  share  prediction  problem  in  Subsec¬ 
tion  4.1  and  explain  why  it  is  important  to  use  a  model  to  predict  the  future.  Then,  in 
Subsection  4.2  we  introduce  the  mean  field  approach  which  is  a  method  used  in  statisti¬ 
cal  physics  to  analyze  the  average  behavior  of  a  complex  dynamic  system.  We  first  apply 
this  theory  to  analyze  the  VwV  model  in  Subsection  4.3  and  discuss  its  asymptotic  behavior 
and  the  time  needed  to  reach  consensus.  We  then  apply  this  theory  to  analyze  the  VwMV 
model  in  Subsection  4.4  and  discuss  its  asymptotic  behavior  in  a  similar  way.  These  theoret¬ 
ical  analysis  sheds  a  light  on  the  opinion  formation  dynamics  and  makes  the  behavior  easy 
to  understand. 


4.1  Share  Prediction  Problem 

Based  on  our  opinion  dynamics  model  (the  VwMV  model),  we  investigate  the  problem 
of  predicting  how  large  a  share  each  opinion  will  have  at  a  future  target  time  T  when  the 
opinion  diffusion  is  observed  from  t  =  0  to  t  =  To  (<  T).  Let  Dt0  be  the  observed  opinion 
diffusion  data  in  time-interval  [0,  To],  where  2>/(|  consists  of  a  sequence  of  ( v,t,k )  such  that 
node  v  changed  its  opinion  to  opinion  k  at  time  t  for  0  <  t  <  To-  For  any  opinion  k,  let  hk{t) 
denote  its  population  at  time  t,  i.e., 


hit )  =  |{v  €  V;  f,(v)  =  k}\. 
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Fig.  1:  An  example  of  opinion  population  curves  in  the  Blog  network  for  K  —  3. 


Figure  1  shows  an  example  of  opinion  population  curves  h\(t),  /12(f),  hj,(t)  for  K  =  3  in 
the  Blog  network  (see  Section  6  below),  where  the  opinion  values  are  set  to  w\  =  1.5,  W2  = 
1 .0,  W3  =  1.1  and  anti-majoritarian  tendency  av  (v  e  V )  is  drawn  from  the  beta  distribution 
with  shape  parameters  a  =  1  and  b  =  99.  Here,  if  we  set  Tq  =  10  and  T  =  30,  we  are  able  to 
observe  Dio  and  thus  {hk(t);  0  <  t  <  10)  for  k  =  1,2,3  and  the  problem  is  to  predict  /?i (30), 
/i2(30),  /?3(30).  Note  that  although  the  opinion  dynamics  is  stochastic,  we  found  that  the 
variance  of  the  value  of  hk( 30)  (k  =  1,2,3)  is  relatively  small  for  To  =  10.  We  can  easily  see 
from  Figure  1  that  the  naive  time-series  analysis  method  or  a  simple  extrapolation  method 
does  not  work  well  for  this  prediction  problem.  Thus,  it  is  crucial  to  accurately  estimate 
the  values  of  the  parameters  of  the  VwMV  model  from  the  observed  opinion  diffusion  data 
(more  to  come  later  on  this). 

Since  the  VwMV  model  gives  a  stochastic  process,  we  introduce  the  expected  share 
gk(t)  of  each  opinion  k  at  time  t  by 


and  consider  the  problem  of  predicting  gk{t)  (k  =  1  ,■■■  ,K)  from  the  observed  data  T)j0, 
which  is  referred  to  as  the  share  prediction  problem.  Here,  (x)  denotes  the  expected  value 
of  a  random  variable  x.  For  solving  the  share  prediction  problem,  we  develop  a  method 
that  effectively  estimates  the  values  of  the  parameters  wk  (k  =  1 ,  ■  •  ■ ,  K)  and  av  (v  e  V)  from 
Dtq-  We  note  that  the  method  developed  can  also  apply  to  detecting  high  anti-majoritarian 
tendency  nodes  (i.e.,  anti-majoritarians)  from  the  observed  opinion  diffusion  data. 

4.2  Mean  Field  Approach 

Below,  we  theoretically  investigate  the  asymptotic  behavior  of  expected  share  gk(t)  (k  = 
1,---  ,K)  of  the  VwMV  model  for  a  sufficiently  large  t,  and  demonstrate  that  it  is  crucial 
to  accurately  estimate  the  values  of  the  parameters,  Wk,  (k  =  1  ,■■■  ,K)  and  a  which  is  the 
average  of  av  over  all  nodes  v  e  V. 

According  to  previous  work  in  statistical  physics,  (e.g.,  Sood  and  Redner  (2005)),  we 
employ  a  mean  field  approach.  We  first  consider  a  rate  equation, 


dgkit ) 


dt 


(1  -  gk(t))Pk(t)  -  gk(t)  (1  -  Pk(t)),  (k  =  1,  •  •  ■ ,  K), 


(4) 
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where  Pk(t)  denotes  the  probability  that  a  node  adopts  opinion  k  at  time  t.  Note  that  in  the 
right-hand  side  of  Eq.  (4),  gk(t)  is  regarded  as  the  probability  of  choosing  a  node  holding 
opinion  k  at  time  t.  Here,  we  assume  that  the  average  local  opinion  share, 


/  nk(t,v)  \ 

\Zj=i  nj(t,v)j 

in  the  neighborhood  of  a  node  v  can  be  approximated  by  the  expected  opinion  share  gk(t) 
of  the  whole  network  for  each  opinion  k.  This  assumption  does  not  hold  in  general  except 
that  the  network  is  a  complete  graph  where  every  node’s  neighbors  are  all  the  nodes  in  the 
graph,  which  is  not  the  case  here.  In  fact,  without  this  assumption,  we  cannot  apply  the 
mean  field  theory  and  analyze  the  average  behavior  of  opinion  dynamics.  Extent  to  which 
this  assumption  is  justified  must  await  experimental  evaluation  by  using  the  real  network 
structure.  As  shown  later,  the  assumption  turned  out  to  be  acceptable.  Under  this  assumption, 
we  obtain  the  following  approximation  from  Eq.  (3): 

Pk(t)  *  (1  -a)pk(t,w)  +  a  1  (k=l,--,K),  (5) 

where  a  is  the  average  value  of  anti-majoritarian  tendency  av,  (v  e  V),  and 


Pk(t,w) 


;,m(0  ,  (k=i,...,Ki 

Zj=  1  Wjgj(t) 


(6) 


Note  that  Eq.  (5)  is  exactly  satisfied  when  G  is  a  complete  network  and  the  anti-majoritarian 
tendency  is  node  independent,  i.e.,  av  =  a,  (Vv  e  V). 


4.3  Analysis  of  VwV  Model 

For  simplicity,  we  begin  with  the  analysis  of  the  VwV  model.  In  this  case,  note  that  av  =  0 
(v  e  V),  i.e.,  a  =  0. 


4.3.1  Share  Analysis 

We  analyze  the  behavior  of  expected  share  gt(t)  (k  =  1  ,■■  ■  ,K)  of  the  VwV  model  for  a 
sufficiently  large  t  according  to  the  above  mean  field  approach.  From  Eqs.  (4),  (5)  and  (6), 
we  have 


dgk(t) 

dt 


gk{t)wk 


Yjl=igk'(t)wk' 
gk(t)wk 

- gk(t )■ 


gk(t)wk 


Zf-=1  gk'{t)wkf 


Zf'  =  l  gk’(t)wk ' 


(V) 


Suppose  that  the  opinion  values  are  non-uniform,  and  let  k*  be  the  opinion  with  the 
highest  value  parameter  such  that  wk*  >  Wk  for  ad  the  other  opinion  k  (k  4  k*).  Here  note 
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that  Wk/wk *  <  1  for  k  +  k*.  Then,  we  can  obtain  the  following  inequality  from  Eq.  (7)  when 
gk(t)  >  0  for  all  k : 


dgk-{t)  =  gk*{f)wk* 
dt  Zf=i  gk(t)wk 
gk*{t)wk* 
Zf=l  gk(t)wk 


K 


YjSk(t) 

k=  1 


Wk_ 

Wk* 


=  0. 


Thus,  unless  gk *  (?)  =  0,  the  opinion  k*  is  expected  to  finally  prevail  the  others,  regardless  of 
its  current  share  since  the  function  gk*(t)  is  expected  to  increase  as  time  passes  until  each  of 
the  other  opinion  shares  becomes  0. 

On  the  other  hand,  suppose  that  the  opinion  values  are  uniform  (i.e.,  w\  =  •••  =  wk )• 
Then,  we  obtain  from  Eq.  (7)  that 


dgk(t) 

dt 


0,  (k=l,—  ,K). 


Thus,  if  there  exists  some  to  >  0  such  that  gi  (to)  =  •  ■  ■  =  gK(to)  =  1  /K,  then  gk(t)  =  1  /  W  (?  >  ?o) 
for  every  opinion  k.  This  implies  that  any  opinion  can  in  general  become  the  majority. 
Hence,  we  have  the  following  results: 

1.  When  the  opinion  values  are  uniform  (i.e.,  w i  =  =  wk),  any  opinion  can  become  a 

winner. 

2.  When  the  opinion  values  are  non-uniform,  the  opinion  k*  with  highest  opinion  value  is 
expected  to  finally  prevail  over  the  others,  that  is,  limr_»oo  gk*  (?)  =  1. 

These  results  suggest  that  it  is  crucially  important  to  accurately  estimate  the  opinion 
values  of  the  VwV  model  from  the  observed  data  f)j0,  3  and  imply  that  the  share  prediction 
problem  can  be  well-defined  only  when  the  opinion  values  are  non-uniform.  We  experimen¬ 
tally  confirmed  the  results  for  several  realistic  networks,  although  the  above  analysis  is  valid 
only  when  the  approximation  (see  Eq.  (5))  holds. 


4.3.2  Consensus  Time  Analysis 

We  further  analyze  the  consensus  time  of  the  VwV  model  by  using  the  above  mean  field 
approach  when  opinion  values  are  non-uniform.  For  simplicity,  we  assume  that  Wk  =  w  if 
k  +  k* ,  i.e..  the  opinion  values  of  the  other  opinions  are  the  same.  4  Let  r  be  the  ratio  of  the 
value  parameters  defined  by  r  =  w/wk*.  Then,  we  obtain  the  following  differential  equation 
for  gk»(t)  from  Eq.  (7): 

dgk*(t)  _  gk*(t) _ 

dt  r(\-gk*(t))+gk*(i)  8k 
=  (1  ~  r)gk*(t)(\  -  gk*(t)) 
f  +  ( 1  —  r)gk*  (?) 

From  this  differential  equation,  we  can  easily  derive  the  following  solution: 

T~  logfe*  (?))  -  — log(l  -  gk*  (?))  =  ?  +  c, 

_ 1  —  r  1  —  r 

3  If  the  goal  is  to  predict  which  opinion  wins  eventually,  it  is  sufficient  to  identify  which  opinion  has  the 
highest  value,  but  if  we  want  to  estimate  the  share  of  each  opinion,  we  need  to  estimate  the  values  accurately. 

4  This  makes  the  analysis  drastically  simpler,  but  the  results  remains  valid  qualitatively. 
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where  C  stands  for  a  constant  of  integration.  Figure  2  shows  examples  of  expected  share 
curves  based  on  the  above  solution  with  different  ratios  of  the  opinion  values,  where  the 
ratio  r  is  set  to  r  =  1  -2~d  ( d  =  1,2, 3, 4,5),  and  each  curve  is  plotted  from  t  =  0  by  assuming 
gk*(0)  =  0.01  until  t  =  T  that  satisfies  gk*(T)  =  0.99.  From  Figure  2,  we  can  see  that  the 
consensus  time  is  quite  short  when  the  ratio  r  is  small,  while  it  takes  somewhat  longer  when 
the  ratio  r  approaches  to  1.  More  importantly,  this  result  indicates  that  the  consensus  time 
of  the  VwV  model  is  extremely  short  even  when  the  ratio  r  is  close  to  1,  compared  with  the 
basic  voter  model  studied  in  previous  work  (e.g.,  Even-Dar  and  Shapira  (2007)).  5  Therefore, 
we  consider  that  voter  model  can  become  more  practical  by  introducing  the  opinion  values. 


o 


o 


50  100  150  200  250  300 

time 


Fig.  2:  Examples  of  expected  share  curves. 


4.4  Analysis  of  VwMV  model 

Next,  we  analyze  the  behavior  of  expected  share  g^(t)  (k  =  l,  -  ,K)  of  the  VwMV  model 
for  a  sufficiently  large  t  according  to  the  above  mean  field  approach. 

4.4.1  Case  of  uniform  opinion  values: 

We  suppose  that  w\  =  ■■■  =  wk-  Then,  since  YJk=i Sk(f)  =  1»  from  Eq.  (6),  we  obtain 


pk(t,w)  =  gk{t),  (k=  !,■■■  ,K). 


Thus,  we  can  easily  derive  from  Eqs.  (4)  and  (5)  that 


Flence,  we  have 


limgk(t)  =  l/K,  (k=l,---,K). 

t—>  OO 


5  Their  results  is  that  the  basic  voter  model  converges  after  0(/i3  log«)  steps  with  probability  1-  o(l)  where 
n  is  the  number  of  nodes. 
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4.4.2  Case  of  non-uniform  opinion  values: 


We  assume  that  the  opinion  values  are  non-uniform.  We  parameterize  the  non-uniformity  by 
the  ratio. 


Sk  = 


Wk 

Zj=i  wj/K 


(l k  =  !,•••  ,K). 


Let  k*  be  the  opinion  with  the  highest  opinion  value.  Note  that  .s>*  >  1.  We  assume  as  before 
for  simplicity  that 

Wk  =  w(<Wk*)  if  k  £  k* , 


where  w  is  a  positive  constant.  We  also  assume  that  there  exists  some  ?o  >  0  such  that 


gi(.to)  =  ---=gK(.to)=  1  IK. 


We  can  see  from  the  symmetry  of  the  setting  that  gk{t)  =  geif),  ( t  >  to)  if  k ,  £  4  k*  .  This 
implies  that  opinion  k*  is  the  winner  at  time  t  if  and  only  if  gk*(t)  >  1  /K.  Then,  from  Eqs.  (4) 
and  (6),  we  obtain 


dgkft) 


clt 


f=r0 


Thus  we  have  from  Eq.  (5)  that 

dgkft) 


dt 


t=tQ 


Sk*  -  1 

K-  1 


Therefore,  we  obtain  the  following  results: 

1.  When  a  <  l-l/K, 


Pk*(to,w )  = 


Sk *_ 

K  ' 


gk*(t)  >  l/K,  {t>tQ), 

that  is,  opinion  k*  is  expected  to  spread  most  widely  and  become  the  majority. 

2.  When  a  =  l-l/K, 


gk(t)=l/K, 


for  any  opinion  k,  that  is,  any  opinion  can  become  a  winner. 

3.  When  a  >  l-l/K , 

gk*(t)<l/K,  (t  >  t0), 

that  is,  opinion  k*  is  expected  to  spread  least  widely  and  become  the  minority. 


4.4.3  Experiments: 

The  above  theoretical  results  are  justified  only  when  the  approximation  (see  Eq.  (5))  holds, 
which  is  always  true  in  the  case  of  complete  networks.  Real  social  networks  are  much  more 
sparse  and  thus,  we  need  to  verify  the  extent  to  which  the  above  results  are  true  for  real 
networks.  We  experimentally  confirmed  the  above  theoretical  results  for  several  real-world 
networks.  Here,  we  present  the  experimental  results  for  K  =  3  in  the  Blog  network  (see 
Section  6),  where  the  opinion  values  are  w \  =2,ws  =  Wi  =  1,  and  anti-majoritarian  tendency 
orv,  (v  e  V )  is  drawn  from  the  beta  distribution  with  certain  combinations  of  shape  parameters 
a  and  b.  Figure  3  shows  the  results  of  opinion  share  curves,  t  h->  hk(t)  j h j(t) ,  (k  =  1,2,3), 
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Fig.  3:  Results  of  the  opinion  share  curves  for  different  distributions  of  anti-majoritarian  tendency  in  the  Blog 
network. 


when  the  distribution  of  anti-majoritarian  tendency  changes,  where  each  node  adopted  one 
of  three  opinions  with  equal  probability  at  time  t  =  0.  Note  that 

a  =  0.33  (<  1-1/3),  if  a  =  2,  b  =  4, 
a  =1-1/3,  if  a  =  4,6  =  2, 

a  =  0.9  (>  1-1/3),  if  a  =18,6  =  2. 

We  obtained  similar  results  to  those  in  Figure  3  also  for  many  other  trials.  These  results 
support  the  validity  of  our  theoretical  analysis. 


5  Learning  Method 

In  this  section  we  describe  a  method  for  estimating  parameter  values  of  the  VwMV  model 
from  a  given  observed  opinion  spreading  data  Dtq-  Based  on  the  evolution  process  of  our 
model  (see  Eq.  (3)),  we  can  obtain  the  likelihood  function, 

n  p(Mv)=k ) 

^v,t,k)eOTo 


£(DTo;w,a)  =  log 


(8) 
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where  w  stands  for  the  A'-dimensional  vector  of  opinion  values,  i.e.,  w  =  ,wk ),  and 

a  is  the  |  V| -dimensional  vector  with  each  element  av  being  the  anti-majoritarian  tendency 
of  node  v.  Thus  our  estimation  problem  is  formulated  as  a  maximization  problem  of  the 
objective  function  £(DT0'.w,a)  with  respect  to  w  and  a.  Note  from  Eqs.  (2),  (3)  and  (8) 
that  £,(DT0\cw,a)  =  c£{T)t0',w  ,a)  for  any  c  >  0.  Note  also  that  each  opinion  value  w*  is 
positive.  Thus,  we  transform  the  parameter  vector  w  by  w  =  trO),  where 

w(x)  =  ,eXK~] ,  I ),  (9) 

Namely,  our  problem  is  to  estimate  the  values  of  x  and  a  that  maximize  £,(DT0'.w(x),a). 

We  derive  an  iterative  algorithm  for  obtaining  the  maximum  likelihood  estimators.  To 
this  purpose,  we  introduce  the  following  parameters  that  depend  on  a:  For  any  v  e  V  and 


fiv,k,j(a) 


1  -  tty  if  j  =  k , 

Qrv/(^T—  1)  if  j  +  k. 


(10) 


Then,  from  the  definition  of  P(ft(v)  =  k)  (see  Eq.  (3)),  by  noting  \  - pfc(t,v,w)  =  'hjtk Pj(t>  v,w), 
we  can  express  Eq.  (8)  as  follows: 

k 

£{DTo-,w(x),a)  =  Yj  loS  ^At,,(«)  Pj(t,v,w(x))  . 

(v,l,k)eD  t0  U=1 


Now,  let  z  and  a  be  the  current  estimates  of  x  and  a,  respectively.  Then,  we  define  qv  t  ^  j(x,  a) 
by 


PvAj(a)  Pj(t9v,w(x)) 
Z,=  |  Aa.iltt)  Pi(t,v,w(x))’ 


{veV,0<t<TQ,k,j  =  !,••• ,  K),  and  transform  our  objective  function  as  follows: 


£(DTQ',w(x),a)  =  Q(x,a-,x,a)-'H(x,a;x,a),  (11) 

where  <3(jc,  or;  jc,  or)  is  defined  by 

Q(x,a;x,a)  =  (2i(;c;;c,Q')  +  <32(a';i,d')>  (12) 

_  K 

<3i(jr;i,a)  =  Y  Yq^t,k,Ax,Sc)  \ogpj(t,v,w{x)),  (13) 

(v,t,k)e£>T0  7=1 
K 

Q2(a\x,a)  =  Y  ^  «)  logA^/ar),  (14) 

(v,t,k)eOTg  j=  1 


and  “H(jc,  a;  x,  a)  is  defined  by 


K 

9i(x,a;x,a)  =  Y  Yq'’-a'-j(*’^  loS9v./.kj(x,a). 

(v,t,k)eOTg  7=1 

Since  rH(x,a;x,a)  is  maximized  at  x  =  x  and  a  =  or,  we  can  increase  the  value  of  £(!DT0'.w(x),a) 
by  maximizing  (3(j:,a';jc,Q')  with  respect  to  jc  and  a  (see  Eq.  (11)).  From  Eq.  (12),  we  can 
maximize  Q(x,a;x,a)  by  independently  maximizing  (3|  (x;x,a)  and  Q2(a;x,a)  with  re¬ 
spect  to  x  and  a,  respectively. 
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First,  we  estimate  the  value  of  x  that  maximizes  (3|  (x;x,  or).  Here,  note  fromEqs.(2)  and 
(9)  that  for  j  =  1,-  ,K  and  A  =!,•••  ,K-  1, 


dpj(t,v,w{x )) 
dxA 


=  6jj pj(t,v,w(x))  -  pj(t,v,w(x))pA{t,v,w{x)), 


where  5p i  is  Kronecker’s  delta.  From  Eqs.  (13)  and  (15),  we  have 

K 


dQ\(x\x,a ) 

dxA 


=  ^  ^  <7v,r,ArjC*3 Or)  (<^ j,A  -  Px{t,  v,  M'(X)))  , 


(15) 


(16) 


for  A  =  1 ,  •  •  • ,  K  —  1 .  Moreover,  from  Eqs.  (15)  and  ( 1 6),  we  have 

qv,t,k,j(x •  O')  (paU,  v,  w(x))  p^(t,  V,  w(x))  -  5a,m  pA(t,  v,  w  (x))) , 

for  A,p  =  l,---  ,K-  1.  Thus,  the  Hessian  matrix  (d1Q\(x\x,a)ldxAdxll)  is  negative  semi- 
definite  since 


()-Q\ (x;x,ar) 
dxA  dxu 


-  E  E 


— 1  d2Q\(x\x,a) 


E 

/i,/l=l 


dxA  dx. 


-y^n 


-  I  I 


(K~  1 


(v,t,k)eDr0  7=1 


<0, 


zi=l 
/sT— 1 


1 


^  pA(t,v,w(x))yA  -  ^  pA(t, v,w(x))yA 


,t=i 


_  JJT-1 

^  pA(t, v, tv(x))  ^  p^{t,v,w{x))yn 

V  >u=l 


,1=1 
(  K- 1 


_  'if*'-1 

1  -  ^  /u(f,v,w(*))  ^  PM(t,v,w(x))y, 

VjU=1 


i=l 


(17) 


for  any  (yi,---  ,vx-i)  6  RA  1 .  Hence,  by  solving  the  equations  dQi(x',x,a)/dxA  =  0,  (A  = 
1.-  ■  ■  ,K—  1)  (see  Eq.  (16)),  we  can  find  the  value  of  x  that  maximizes  <2i(x;x,dr).  We  em¬ 
ployed  a  standard  Newton  Method  in  our  experiments. 

Next,  we  estimate  the  value  of  a  that  maximizes  (22(ar;x, a).  From  Eqs.  (10)  and  (14), 
we  have 

<32(o';  x,  or)  =  [qVM(x,  a)  log(l  -  ay)  +  ( 1  -  qv,tpk(x,  a))  log  ( . 

(v,t,k)eDTg 


Note  that  (S^o^x,  or)  is  also  a  convex  function  of  a.  Therefore,  we  obtain  the  unique  solution 
a  that  maximizes  <2(x,or;x,or)  as  follows: 


1 


Cty  — 


|£>7b(v)|  , 


^  (1  ~qv,  t,kjc(x,  or)), 


for  each  v  e  V,  where  Dj0(v )  =  {(f,k);  ( v,t,k )  e  Dr0}- 
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When  av  =  0  for  any  v  G  V,  the  VwMV  model  is  reduced  to  the  VwV  model.  Thus,  a 
straightforward  application  of  the  above  learning  algorithm  for  the  VwMV  model  gives  the 
learning  algorithm  for  the  VwV  model.  Note  here  that  for  the  VwV  model,  the  objective 
function  becomes 

£(£)r0;w(jr),0)  =  ^  logpk(t,v,w(x)), 

(v,a-)e»r0 

(see  Eqs.  (1)  and  (8)),  and  its  second  derivatives  become 

[pxit,v,w{x))  Pn(t,v,w{x))  -  6, ijj  px(t,v,w(x))) , 

(A,p=  l,---  ,  K—  1).  In  a  similar  way  to  Eq.  (17),  we  can  easily  prove  that  the  Hessian  matrix 
(d2 £,(DTo',w(x),0)/dx/\dXn)  is  negative  semi-definite.  Therefore,  we  can  guarantee  that  the 
optimal  solution  of  the  objective  function  is  global  optimal  for  the  VwV  model.  Here,  we 
mention  that  although  it  is  not  guaranteed  that  the  optimal  solution  of  the  objective  function 
of  the  VwMV  model  is  global  optimal,  their  estimated  parameter  values  converged  very 
closely  to  their  true  values  in  our  experiments  when  there  is  an  enough  amount  of  training 
data. 


&£Wt0-Mx),0) 
dx,i  dxn 


-  E 

(v,t,k)eDTn 


6  Experimental  Evaluation 

Using  large  real  networks,  we  experimentally  investigate  the  capability  of  the  proposed 
model  and  the  performance  of  the  proposed  learning  method.  We  first  show  the  results  of 
the  accuracies  of  predicting  future  opinion  shares.  We  then  show  the  results  of  the  estima¬ 
tion  error  of  anti-majoritarian  tendency,  and  the  accuracies  of  detecting  nodes  with  high 
anti-majoritarian  tendency  (i.e.,  anti-majoritarians). 


6.1  Experimental  Settings 

We  employed  four  datasets  of  large  real  networks,  which  are  all  bidirectionally  connected 
networks  6  and  exhibit  many  of  the  key  features  of  social  networks.  7  The  first  one  is  a 
trackback  network  of  Japanese  blogs  (Kimura  et  al,  2009)  that  has  12,047  nodes  and  79,920 
directed  links  (the  Blog  network).  The  second  one  is  a  Coauthor  network  (Palla  et  al,  2005) 
and  has  12,357  nodes  and  38,896  directed  links  (the  Coauthor  network).  The  third  one  is 
a  network  derived  from  the  Enron  Email  Dataset  (Klimt  and  Yang,  2004)  by  extracting  the 
senders  and  the  recipients  and  linking  those  that  had  bidirectional  communications.  It  has 
4,254  nodes  and  44,314  directed  links  (the  Enron  network).  The  last  one  is  a  network  of 
people  that  was  derived  from  the  “list  of  people”  within  Japanese  Wikipedia  (Kimura  et  al, 
2009),  which  has  9,481  nodes  and  245,044  directed  links  (the  Wikipedia  network).  Just  to 
provide  a  sense  of  how  fast  the  opinion  can  propagate,  the  average  shortest  path  of  each 
network  is  given  here:  8.175  for  the  Blog  network,  8.160  for  the  Coauthor  network,  3.726 
for  the  Enron  network  and  4.700  for  the  Wikipedia  network. 

6  Opinion  propagation  is  directional.  Choosing  bidirectional  networks  means  that  opinion  can  propagate 
in  both  directions. 

7  It  would  be  the  best  if  we  can  use  the  real  opinion  propagation  data.  However,  as  we  are  not  able  to  find 
such  data,  the  next  best  is  to  use  the  network  structures  constructed  from  the  real  world  social  media  data  (not 
synthetic  networks). 
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To  do  experiments,  we  have  to  first  determine  the  values  of  parameters:  the  number 
of  opinions  K ,  the  true  value  of  each  opinion  w*  the  true  value  of  each  anti-majoritarian 
tendency  a*,  (v  e  V).  We  varied  K  =  2,3,-  •• ,  10,  and  chose  w*k  from  the  interval  [0.5, 1.5] 
uniformly  at  random  and  a*  by  drawing  it  from  the  beta  distribution  with  the  shape  parame¬ 
ters  a  and  b.  We  chose  the  beta  distribution  simply  because  of  the  easiness  of  controlling  the 
average  and  variance  of  the  distribution.  As  implied  in  Subsection  3.1,  we  used  the  expo¬ 
nential  distribution  with  rjv  =  1  to  determine  the  opinion  update  time.  Which  nodes  to  start 
from  is  another  problem.  As  explained  also  in  Subsection  3.1  we  assigned  each  opinion  to 
only  one  node  initially  and  all  other  nodes  were  set  in  the  neutral  state.  Those  initially  as¬ 
signed  K  nodes  are  taken  from  the  top  K  nodes  with  respect  to  the  node  degree  ranking.  We 
start  simulating  the  opinion  propagation  process  from  these  K  nodes  using  the  parameter 
values  which  are  assumed  true,  and  generated  £>t0-  As  for  our  learning  settings,  we  set  the 
initial  value  of  each  value  parameter  to  \\>k  =  1,  and  the  initial  value  of  each  anti-majoritarian 
tendency  to  av  =  0.5,  (v  e  V).  We  terminated  the  learning  iteration  when  the  increase  of  our 
objective  function  becomes  sufficiently  small,  i.e., 

£(DTo ; w, a)  - £(DTo ; w, a)  <  iq_8 

£(£>T0;w,a) 

where  w  and  a  mean  the  parameter  vectors  updated  from  w  and  a.  Note  that  our  learning 
algorithm  always  increases  our  objective  function  as  described  in  the  previous  section. 


6.2  Share  Prediction 

For  each  number  of  opinions  (k  =  2,3,  ■  ■  ■ ,  K)  we  predicted  the  expected  share  gk{T)  for 
the  observed  data  Dtq-  where  we  set  T  =  30,  and  investigated  the  cases  Tq  =  10, 15  and 
a  =  0.5,0.1,0.01  by  generating  av  with  {a,b)  =  (2,2),  (1,9), (1,99),  respectively.  As  we  men¬ 
tioned  in  Section  1  we  think  it  is  important  to  learn  the  model  using  a  small  amount  of 
data  and  predict  the  near  future.  Since  the  average  shortest  path  of  each  network  is  less 
than  10,  Tq  =  10  is  the  minimum  training  time  required  to  learn  the  parameters  for  all  the 
nodes.  Note  that  a  means  the  average  anti-majoritarian  tendency,  which  is  given  by  a/(a  +  b). 
Namely,  after  we  have  estimated  the  values  of  each  w/,  and  each  av,  we  predicted  the  value 
of  gk(T)  by  simulating  the  model  M  times  from  £>r0  and  taking  their  average,  where  we 
used  M  =  100.  In  fact,  our  preliminary  experiments  indicate  that  the  results  for  M  =  100  are 
not  much  different  from  those  for  M  =  1,000  and  10,000  in  the  networks  we  used. 

In  order  to  investigate  the  importance  of  introducing  the  anti-majoritarian  tendency  of 
each  node,  we  compared  the  proposed  method  with  the  VwV  model  which  has  no  anti- 
majoritarian  component.  Moreover,  in  order  to  investigate  the  importance  of  introducing  the 
opinion  values,  we  also  compared  the  proposed  method  with  the  same  VwMV  model  in 
which  the  opinion  values  are  constrained  to  take  a  uniform  value  and  the  anti-majoritarian 
tendency  of  each  node  is  the  only  parameter  to  be  estimated.  We  refer  to  this  method  as 
the  uniform  value  method.  Furthermore,  given  the  observed  data  £)t0,  we  can  simply  apply 
a  polynomial  extrapolation  for  predicting  the  expected  share  of  opinion  k  at  a  target  time 
T,  since  we  can  naively  speculate  that  the  recent  trend  for  each  opinion  captured  by  the 
polynomial  function  approximation  continues.  Thus,  we  consider  predicting  the  values  of 
gi(T),  ■■■,  gK(T),  by  estimating  the  value  of  the  population  hk(T)  of  opinion  k  at  time  T 
based  on  the  polynomial  function  of  degree  L  that  interpolates  the  L+  1  data  points  [(Tq  - 
A  +  £A/L.hk(To~A  +  {A/L));{  =  0, 1,  -  ■  •  ,L|,  where  A  is  the  parameter  with  0  <  A  <  Tq.  We 
refer  to  this  prediction  method  as  the  polynomial  extrapolation  method.  In  our  experiments, 
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Table  1:  Results  of  opinion  share  prediction  for  the  Blog  network  (To  =  10,  a  =  0.1,  K  =  10).  Note  that  the 
two-side  0.05  point  of  the  /-distribution  with  9  degrees  of  freedom  is  2.262. 


Method 

Average  of  error  Sg 

/-value  Tgc 

proposed 

0.0396 

— 

VwV 

0.4520 

15.4645 

uniform  value 

0.5172 

13.4893 

linear  (A  =  1) 

0.4996 

12.5024 

linear  (A  =  3) 

0.4243 

12.0177 

linear  (A  =  5) 

0.3247 

14.4648 

quadratic  (A  =  1) 

1.0845 

13.2210 

quadratic  (A  =  3) 

1.2795 

26.9768 

quadratic  (A  =  5) 

1.3296 

15.6869 

cubic  (A  =  1) 

1.3710 

18.3478 

cubic  (A  =  3) 

1.1799 

12.5790 

cubic  (A  =  5) 

1.1219 

16.7674 

quartic  (A  =  1) 

1.1963 

19.0506 

quartic  (A  =  3) 

1.1079 

16.2728 

quartic  (A  =  5) 

1.0956 

11.9049 

Table  2:  Results  of  opinion  share  prediction  for  the  Coauthor  network  (To  =  10,  a  =  0.1,  K  =  10).  Note  that 
the  two-side  0.05  point  of  the  /-distribution  with  9  degrees  of  freedom  is  2.262. 


Method 

Average  of  error  Sg 

/-value  TJfc 

proposed 

0.0590 

— 

VwV 

0.4634 

15.9587 

uniform  value 

0.4422 

13.5598 

linear  (A  =  1) 

0.4193 

14.1568 

linear  (A  =  3) 

0.2814 

10.2062 

linear  (A  =  5) 

0.2097 

9.5952 

quadratic  (A  =  1) 

1.0794 

17.6792 

quadratic  (A  =  3) 

1.2158 

12.9942 

quadratic  (A  =  5) 

1.6140 

22.1016 

cubic  (A  =  1) 

1.1616 

16.4268 

cubic  (A  =  3) 

1.1615 

17.8509 

cubic  (A  =  5) 

0.9575 

18.6748 

quartic  (A  =  1) 

1.1852 

14.3082 

quartic  (A  =  3) 

1.0971 

14.1797 

quartic  (A  =  5) 

1.1889 

17.3193 

we  adopted  L  =  1,2,3, 4,  i.e.,  the  linear,  quadratic,  cubic,  and  quartic  polynomial  functions, 
and  examined  A  =  1,  A  =  3,  and  A  =  5.  We  evaluated  the  effectiveness  of  the  proposed  share 
prediction  method  by  comparing  it  with  the  above  six  methods  (VwV,  uniform  and  four 
polynomial). 

Let  ~gk{T)  be  the  estimate  of  gk(T)  by  a  share  prediction  method.  We  measured  the 
performance  of  the  share  prediction  method  by  the  prediction  error  Sg  defined  by  8 

K 

&g  =  Yjfek(T)-gk(T)\. 
k=  1 


8  It  may  sound  more  reasonable  to  weight  each  difference  by  the  share  itself,  but  we  decided  not  to  do  so. 
We  rather  considered  the  prediction  problem  as  the  classification  problem. 
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Table  3:  Results  of  opinion  share  prediction  for  the  Enron  network  (Tq  =  10,  a  =  0.1,  K  =  10).  Note  that  the 
two-side  0.05  point  of  the  /-distribution  with  9  degrees  of  freedom  is  2.262. 


Method 

Average  of  error  Sg 

/-value  Tgc 

proposed 

0.0731 

— 

VwV 

0.6030 

12.7367 

uniform  value 

0.608B 

20.4684 

linear  (A  =  1) 

0.6909 

8.4882 

linear  (A  =  3) 

0.6511 

11.2945 

linear  (A  =  5) 

0.5577 

15.1556 

quadratic  (A  =  1) 

1.1341 

11.4784 

quadratic  (A  =  3) 

1.0765 

13.9631 

quadratic  (A  =  5) 

1.1763 

14.7290 

cubic  (A  =  1) 

1.2378 

15.1644 

cubic  (A  =  3) 

1.1378 

16.2330 

cubic  (A  =  5) 

1.2605 

16.6612 

quartic  (A  =  1) 

1.1699 

11.4147 

quartic  (A  =  3) 

1.3411 

32.1862 

quartic  (A  =  5) 

1.1910 

15.0670 

Table  4:  Results  of  opinion  share  prediction  for  the  Wikipedia  network  (Tq  =  10,  a  =  0.1,  K  =  10).  Note  that 
the  two-side  0.05  point  of  the  /-distribution  with  9  degrees  of  freedom  is  2.262. 


Method 

Average  of  error  Sg 

/-value  Tgc 

proposed 

0.0390 

— 

VwV 

0.4429 

12.8927 

uniform  value 

0.6000 

11.6327 

linear  (A  =  1) 

0.5151 

13.2910 

linear  (A  =  3) 

0.4073 

12.2377 

linear  (A  =  5) 

0.3968 

14.8808 

quadratic  (A  =  1) 

1.1122 

12.8117 

quadratic  (A  =  3) 

1.1521 

15.0864 

quadratic  (A  =  5) 

1.1674 

16.0370 

cubic  (A  =  1) 

1.2193 

13.7714 

cubic  (A  =  3) 

1.1950 

16.4728 

cubic  (A  =  5) 

1.0156 

16.7386 

quartic  (A  =  1) 

1.0679 

12.1467 

quartic  (A  =  3) 

1.2045 

18.3987 

quartic  (A  =  5) 

1.3886 

27.4023 

We  first  examined  the  case  of  Tq  =  10,  a  =  0.1  and  K  =  10.  Tables  1,  2,  3  and  4  are  the 
results  of  opinion  share  prediction  for  the  Blog,  the  Coauthor,  the  Enron  and  the  Wikipedia 
networks,  respectively.  We  conducted  10  trials  varying  the  true  values  of  value  parameters 
for  each  K,  and  the  second  column  in  Tables  1,  2,  3  and  4  indicates  the  average  of  Sg  over  the 
10  trials.  In  order  to  investigate  whether  the  difference  of  the  prediction  error  Sg  between 
the  proposed  method  and  each  of  the  other  methods  used  for  comparison  is  statistically 
significant  or  not,  we  performed  a  /-test.  Let  £^  and  £^  denote  the  values  of  Sg  for  the 
proposed  method  and  the  compared  method,  respectively.  We  calculated  /- value 


T, 


PC 
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number  of  opinions 

(a)  T0  =  10,  a  =  0.5. 


number  of  opinions 

(b)  T0  =  10,  a  =  0.1. 


Fig.  4:  Results  of  opinion  share  prediction  for  the  Blog  network. 


where  mean(x)  and  std(x)  denote  the  standard  average  and  the  sample  standard  deviation  of 
sample  x,  respectively.  In  Tables  1,  2,  3  and  4,  the  third  column  indicates  the  f-value  Tgc  ■ 
Here,  note  that  the  two-side  0.05  point  of  the  t-distribution  with  9  degrees  of  freedom  is 
r*0Q5  =  2.262.  Thus,  we  see  that  in  the  case  of  To  =  10,  a  =  0.1  and  K  =  10,  the  difference 
between  the  proposed  method  and  each  of  the  compared  methods  in  prediction  error  Sg  is 
statistically  significant  by  the  f-test  at  significance  level  0.05.  Moreover,  from  Tables  1,  2, 
3  and  4,  we  see  that  the  linear  extrapolation  method  performed  best  among  the  polynomial 
extrapolation  methods  in  the  case  of  Tq  =  10,  a  =  0.1  and  K  =  10.  We  obtained  the  same 
results  for  the  other  cases  with  different  combinations  of  Tq,  a  and  K.  Thus,  we  show  only 
the  results  of  the  linear  extrapolation  method  for  the  polynomial  extrapolation  method. 

Figure  4  is  the  results  for  the  Blog  network,  where  circles,  diamonds  and  upward  tri¬ 
angles  indicate  the  prediction  errors  of  the  proposed  method,  the  VwV  method,  and  the 
uniform  value  method,  respectively,  and  downward  triangles,  squares,  and  crosses  indicate 
those  of  the  linear  extrapolation  method  adopting  A  =  1,  A  =  3,  and  A  =  5,  respectively. 
Figure  4  (a),  (b),  (c)  and  (d)  are  the  results  for  (To, O')  =  (10,0.5),  (10,0.1),  (10,0.01),  and 
(15,0.01),  respectively.  Figures  5,  6,  and  7  are  the  results  for  the  other  three  networks,  i.e., 
the  Coauthor  network,  the  Enron  network,  and  the  Wikipedia  network,  respectively. 

From  these  figures,  we  see  that  the  proposed  method  worked  substantially  better  than 
the  other  methods.  More  specifically,  the  VwV  method  worked  poorly  when  values  for 
the  anti-majoritarian  tendency  were  relatively  large.  Conversely,  the  uniform  value  method 
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Fig.  5:  Results  of  opinion  share  prediction  for  the  Coauthor  network 


worked  poorly  when  they  were  relatively  small.  These  results  are  predictable  because  the 
VwV  method  cannot  cope  with  the  effect  of  the  anti-majoritarian  tendency  and  the  uniform 
value  method  cannot  cope  with  the  effect  of  opinion  value.  We  further  see  that  the  proposed 
method  significantly  outperformed  the  polynomial  extrapolation  method  in  every  case.  Es¬ 
pecially,  we  observed  that  the  proposed  method  accurately  predicted  the  share  at  T  even  in 
the  case  that  the  share  ranking  at  To  got  reversed  at  the  target  time  T  as  shown  in  Figure  1. 
This  is  attributed  to  the  use  of  the  estimated  value  parameters  which  take  different  values 
for  different  opinions,  and  is  consistent  with  the  results  of  the  mean  field  analysis.  We  also 
observe  that  compared  with  cases  of  a  =  0.5  and  a  =  0.1,  the  performance  of  the  proposed 
method  in  case  of  a  =  0.01  becomes  worse  for  To  =  10.  This  is  because  the  opinion  change 
driven  by  the  anti-majoritarians  is  smaller  when  a  is  smaller,  thereby  providing  less  effec¬ 
tive  training  data  for  learning  a.  Larger  error  for  a  negatively  affects  the  results  of  share 
prediction  despite  the  effect  of  anti-majoritarians  is  less.  However,  it  becomes  better  and 
comparable  to  the  other  cases  for  Tq  =  15  as  expected  since  the  amount  of  training  data 
increases. 

During  the  experiments  we  noticed  that  the  time  needed  to  reach  the  consensus  gets 
longer  when  the  difference  between  the  largest  and  the  second  largest  values  of  the  opinion 
value  parameters  is  small.  This  can  also  be  predicted  by  the  consensus  time  analysis,  i.e., 
considering  the  case  where  the  highest  two  values  are  the  same  and  the  rest  are  also  the 


same. 
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Fig.  6:  Results  of  opinion  share  prediction  for  the  Enron  network 


In  this  subsection,  we  focused  only  on  the  accuracy  of  share  prediction  and  did  not 
discuss  the  accuracy  of  parameter  learning.  As  conjectured  in  Section  1,  learning  the  opinion 
values  is  easy  and  learning  the  anti-majoritarian  tendency  is  hard.  Indeed,  all  the  opinion 
values  can  be  estimated  in  good  accuracy.  The  average  error  was  6%  even  using  a  training 
data  for  such  a  short  period  of  time.  However,  as  predicted,  the  average  error  of  the  estimated 
anti-majoritarian  tendency  is  large.  For  example,  in  the  case  of  To  =  10,  a  =  0.5  and  K  =  10, 
the  average  value  of  error  Sa  was  more  than  0.17  for  all  the  four  networks.  Namely,  the 
estimation  error  of  anti-majoritarian  tendency  for  each  node  was  more  than  (0.17/0.5)* 
100  =  34%  on  the  average.  This  is  because  the  number  of  parameters  is  the  same  as  the 
number  of  nodes  which  is  very  large.  Nevertheless,  the  accuracy  of  share  prediction  is  very 
good.  For  example,  in  the  case  of  Tq  =  1 0,  a  =  0.5  and  K  =  10,  the  average  value  of  error 
Sg  was  less  than  0.026  for  all  the  four  networks.  Namely,  the  share  prediction  error  for  each 
opinion  was  less  than  ((0.026/ 10)/(1  / 10))  *  100  =  2.6%  on  the  average.  This  looks  strange 
at  a  glance,  but  we  can  explain  the  reason  as  follows.  We  started  with  the  K  distinct  initial 
nodes  and  all  the  other  nodes  were  neutral  in  the  beginning.  Recall  that  we  set  the  average 
time  delay  to  1.0,  which  means  that  on  the  average  each  node  updates  its  opinion  every 
single  time  unit.  Thus  when  Tq  =  10  the  opinion  updates  can  propagate  10  steps  on  the 
average.  As  explained  in  Subsection  6.2,  considering  that  the  average  shortest  path  of  the 
network  is  less  than  10  for  all  the  networks,  opinion  update  takes  place  barely  almost  all  the 
nodes.  For  some  nodes  the  number  of  updates  is  10  and  for  other  nodes  it  is  1.  The  accuracy 
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Fig.  7:  Results  of  opinion  share  prediction  for  the  Wikipedia  network 


estimation  error 


Fig.  8:  Distribution  of  estimation  error  for  anti-majority  tendency  of  each  node  in  the  Blog  network  (To  =  10, 
a  =  0.5,  K  =  10). 


of  the  anti-majoritarian  tendency  for  these  nodes  where  the  opinion  updates  are  very  few  is 
indeed  very  bad  (no  valid  learning  took  place),  but  the  accuracy  for  the  nodes  that  undergo 
several  opinion  updates  is  good.  The  variance  of  the  node-wise  accuracy  is  large.  Figure  8 
which  is  the  cumulative  error  probability  P(\av  -  a* \  >  x )  in  case  of  Tq  =  10,  a  =  0.5  and 
K  =  10  for  the  Blog  network  clearly  indicates  this,  where  each  a*  and  ay  denote  the  true 
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and  the  estimated  anti-majoritarian  tendencies  of  node  v,  respectively.  The  average  error  is 
indeed  large  and  about  30%  of  nodes  have  errors  greater  than  50%.  However,  as  the  mean 
held  analysis  implies,  it  is  the  average  of  the  anti-majoritarian  tendency  that  matters,  as 
the  first  approximation,  as  far  as  the  opinion  share  is  concerned.  In  this  case,  we  can  verify 
that  Zvev  |%' -  «v|  IW\  =  0.181 1  and  |2vgyOrv/|V|  -Evevav/I^l|  =  0.0015.  The  latter  is  three 
orders  of  magnitude  less.  This  explains  the  good  accuracy  of  the  opinion  share  despite  the 
bad  accuracy  of  the  anti-majoritarian  tendency.  In  the  next  subsection  we  will  describe  the 
accuracy  of  the  anti-majoritarian  tendency  using  more  training  data. 

To  sum  up,  we  confirmed  that  the  results  of  our  theoretical  analyses  hold  in  these  real 
networks  and  that  the  proposed  method  outperforms  the  polynomial  extrapolation  method. 
On  the  average,  the  prediction  error  of  the  proposed  method  was  about  four  times  less  for  a 
given  T0.  Besides,  it  achieved  a  comparable  prediction  accuracy  with  the  observation  time 
three  times  less  compared  with  the  polynomial  extrapolation  method. 


6.3  Discovery  of  Anti-majority  Opinionists 

We  examined  the  accuracy  of  discovering  anti-majoritarian  opinionists  (and  majoritarian 
opinionists)  for  both  a  small  ( K  =  3)  and  a  large  (K  =  10)  K,  by  varying  7o  =  100,  200, 
1000.  The  error  is  measured  by  Sa , 


&a= i4xi|“v-Q^' 

1  1  veV 

We  also  measured  the  accuracies  of  detecting  the  high  and  the  low  anti-majoritarian  ten¬ 
dency  nodes  by  F-measures  Ta  and  Tn,  respectively.  Here,  Ta  and  Tn  are  defined  as  fol¬ 
lows: 

_  2|AnA*|  2|7Vn  7V*| 

J  A  —  ~  >  J  N  —  ~  > 

|A|  +  |A*|  \N\  + 1  A*  | 

where  A*  and  A  are  the  sets  of  the  true  and  the  estimated  top  15%  nodes  of  high  anti- 
majoritarian  tendency,  respectively,  and  N *  and  N  are  the  sets  of  the  true  and  the  estimated 
top  15%  nodes  of  low  anti-majoritarian  tendency,  respectively. 

We  compared  the  proposed  method  with  the  naive  approach  in  which  the  anti-majoritarian 
tendency  of  a  node  is  estimated  by  simply  counting  the  number  of  opinion  updates  in  which 
the  opinion  chosen  by  the  node  is  the  minority’s  opinion  in  its  neighborhood.  We  refer  to 
the  method  as  the  naive  counting  method.  We  also  compared  the  proposed  method  with  the 
uniform  value  method  mentioned  in  the  previous  subsection. 

Figures  9  and  10  are  the  results  for  the  Blog  network,  where  circles,  upward  triangles, 
and  squares  indicate  the  prediction  errors  and  the  F-measure  performance  of  the  proposed 
method,  the  uniform  value  method,  and  naive  method ,  respectively.  Figures  9  (a),  and  (b) 
show  the  estimation  error  &a  of  each  method  as  a  function  of  time  span  Tq  with  K  =  3 
and  K  =  10,  respectively,  while  Figures  10  (a)  and  (b)  the  F-measure  Ta  of  each  method 
as  a  function  of  time  span  Tq  with  K  =  3  and  K  =  10,  respectively.  Here,  we  repeated  the 
same  experiment  10  times  independently,  and  plotted  the  average  over  the  10  results.  Fig¬ 
ures  11,  12,  13,  14,  15  and  16  are  the  results  for  the  other  three  networks,  i.e.,  the  Coauthor 
network,  the  Enron  network,  and  the  Wikipedia  network,  respectively.  Note  that  we  only 
showed  the  results  for  a  =  0.5,  i.e.,  a  =  b  =  2,  because  we  obtained  quite  similar  results  for 
the  other  anti-majoritarian  tendency  a. 
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(a)  Results  for  K  =  3. 


(b)  Results  for  K  =  10. 


Fig.  9:  Estimation  errors  of  anti-majoritarian  tendency  for  the  Blog  network. 


time  span  time  span 

(a)  Results  for  K  =  3.  (b)  Results  for  K  =  10. 

Fig.  10:  Accuracies  of  extracting  nodes  with  high  anti-majoritarian  tendency  for  the  Blog  network. 


Table  5:  Results  for  estimation  errors  of  anti-majoritarian  tendency  for  the  Blog  network  (To  =  1000).  Note 
that  the  two-side  0.05  point  of  the  /-distribution  with  9  degrees  of  freedom  is  tg  0  „5  =  2.262. 


Method 

K  =  3 

Average  of  error  Sa 

K  =  3 

/-value  7jfc 

K=  10 

Average  of  error  Sa 

K=  10 
/-value  7jfc 

proposed 

0.0229 

— 

0.0169 

— 

uniform  value 

0.0280 

5.6016 

0.0186 

5.0033 

naive 

0.1403 

229.4537 

0.1607 

577.3649 

In  order  to  investigate  whether  the  difference  between  the  proposed  method  and  each  of 
the  other  methods  is  statistically  significant  or  not,  we  in  particular  performed  a  f-test  for 
estimation  error  Sa.  Let  and  £„  denote  the  values  of  Sa  for  the  proposed  method  and  a 
compared  method,  respectively.  We  calculated  t-  value 

pc  _  VlO  mean (££-££) 
std  (££-££) 

where  mean(.r)  and  std(x)  are  defined  in  the  previous  section.  Tables  5,  6,  7,  and  8  show  the 
results  for  estimation  errors  of  anti-majoritarian  tendency  in  the  case  of  Tq  =  1000  for  the 
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(a)  Results  for  K  =  3. 


(b)  Results  for  K  =  10. 


Fig.  1 1:  Estimation  errors  of  anti-majoritarian  tendency  for  the  Coauthor  network. 


Fig.  12:  Accuracies  of  extracting  nodes  with  high  anti-majoritarian  tendency  for  the  Coauthor  network. 


Table  6:  Results  for  estimation  errors  of  anti-majoritarian  tendency  for  the  Coauthor  network  (7o  =  1000). 
Note  that  the  two-side  0.05  point  of  the  /-distribution  with  9  degrees  of  freedom  is  Cogs  =  2.262. 


Method 

K  =  3 

Average  of  error  Sa 

K  =  3 

/-value  Tjfc 

K=  10 

Average  of  error  Sa 

K=  10 
/-value  Tjfc 

proposed 

0.0195 

— 

0.0147 

— 

uniform  value 

0.0208 

4.0840 

0.0150 

9.9920 

naive 

0.1350 

404.8052 

0.1074 

526.8500 

Blog,  the  Coauthor,  the  Enron,  and  the  Wikipedia  networks,  respectively.  Here,  the  second 
and  the  fourth  columns  indicate  the  average  of  Sa  over  the  10  trials  for  the  cases  of  K  =  3  and 
K  =  10,  respectively.  Also,  the  third  and  the  fifth  columns  indicate  /-value  7_(fc  for  the  cases 
of  K  =  3  and  K  =  10,  respectively.  Note  that  the  two-side  0.05  point  of  the  /-distribution 
with  9  degrees  of  freedom  is  /*  Q  Q5  =  2.262.  Thus,  from  Tables  5,  6,  7,  and  8,  we  see  that  in 
the  case  of  Tq  =  1000,  the  difference  between  the  proposed  and  each  comparison  methods 
in  prediction  error  Sa  is  statistically  significant  by  the  /-test  at  significance  level  0.05.  Note 
that  we  only  showed  the  results  for  Tq  =  1000,  because  we  obtained  quite  similar  results  for 
other  values  of  Tq  >  100.  As  explained  in  Subsection  6.2,  Tq  =  10  is  too  short  for  learning 
anti-majoritarians. 
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time  span  time  span 


(a)  Results  for  K  =  3.  (b)  Results  for  K  =  10. 

Fig.  13:  Estimation  errors  of  anti-majoritarian  tendency  for  the  Enron  network. 


time  span  time  span 


(a)  Results  for  K  =  3. 


(b)  Results  for  K  =  10. 


Fig.  14:  Accuracies  of  extracting  nodes  with  high  anti-majoritarian  tendency  for  the  Enron  network. 


Table  7:  Results  for  estimation  errors  of  anti-majoritarian  tendency  for  the  Enron  network  (7o  =  1000).  Note 
that  the  two-side  0.05  point  of  the  /-distribution  with  9  degrees  of  freedom  is  /*  0  Qs  =  2.262. 


Method 

K  =  3 

Average  of  error  Sa 

K  =  3 

/-value  Tjfc 

K=  10 

Average  of  error  Sa 

K=  10 
/-value  t£c 

proposed 

0.0254 

— 

0.0186 

— 

uniform  value 

0.0331 

3.8280 

0.0220 

8.5671 

naive 

0.1453 

101.3125 

0.1863 

306.6563 

As  expected,  &a  decreases,  and  Ta  increases  as  To  increases  ( i.e .,  the  amount  of  training 
data  DTo  increases).  We  observe  that  the  proposed  method  performs  the  best,  the  uniform 
value  method  follows,  and  the  naive  method  behaves  very  poorly  for  all  the  networks.  Here, 
we  note  that  quite  similar  results  were  also  observed  for  Tn,  i.e.,  extracting  nodes  with  low 
anti-majoritarian  tendency  although  those  results  are  not  reported  in  this  paper.  The  proposed 
method  can  detect  both  the  anti-majoritarians  and  the  majoritarians  with  the  accuracy  greater 
than  90%  at  T  =  1000  for  all  cases.  We  can  also  see  that  the  proposed  method  is  not  sensitive 
to  both  K  and  the  network  structure  because  of  the  explicit  use  of  the  model,  but  the  other  two 
methods  are  so.  For  example,  although  the  uniform  value  method  of  K  =  10  performs  well 
in  Ta  for  the  Blog,  Coauthor  and  Enron  networks,  it  does  not  so  in  Ta  for  the  Wikipedia 
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(a)  Results  for  K  =  3. 


(b)  Results  for  K  =  10. 


Fig.  15:  Estimation  errors  of  anti-majoritarian  tendency  for  the  Wikipedia  network. 


time  span  time  span 


(a)  Results  for  K  =  3.  (b)  Results  for  K  =  10. 

Fig.  16:  Accuracies  of  extracting  nodes  with  high  anti-majoritarian  tendency  for  the  Wikipedia  network. 


Table  8:  Results  for  estimation  errors  of  anti-majoritarian  tendency  for  the  Wikipedia  network  (To  =  1000). 
Note  that  the  two-side  0.05  point  of  the  /-distribution  with  9  degrees  of  freedom  is  /„  „  Q5  =  2.262. 


Method 

K  =  3 

Average  of  error  Sa 

K  =  3 

/-value  7jfc 

K=  10 

Average  of  error  Sa 

K=  10 
/-value  7jfc 

proposed 

0.0336 

— 

0.0224 

— 

uniform  value 

0.0489 

3.2202 

0.0360 

9.0308 

naive 

0.1550 

51.3607 

0.2409 

392.8008 

network.  These  results  clearly  demonstrate  the  advantage  of  the  proposed  method,  and  it 
does  not  seem  feasible  to  detect  even  roughly  the  high  anti-majoritarian  tendency  nodes 
without  using  the  explicit  model  and  solving  the  optimization  problem. 

Here,  we  also  note  that  the  proposed  method  accurately  estimated  the  opinion  values.  In 
fact,  the  average  estimation  errors  of  opinion  value  were  less  than  1%  at  Tq  =  1000  for  all 
cases.  Moreover,  we  note  that  the  processing  times  of  the  proposed  method  at  Tq  =  1000  for 
K  =  3  and  K  =  10  were  less  than  3  min.  and  4  min.,  respectively.  All  our  experiments  were 
executed  on  a  single  PC  with  an  Intel  Core  2  Duo  3GHz  processor,  with  2GB  of  memory, 
running  under  Linux. 
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7  Conclusion 

Unlike  the  popular  probabilistic  model  such  as  Independent  Cascade  and  Linear  Threshold 
models  for  information  diffusion  where  the  node  in  the  network  takes  only  one  of  the  two 
states  (active  or  inactive),  applications  such  as  on-line  competitive  service  in  which  a  user 
can  choose  one  from  multiple  choices  and  opinion  formation  in  which  a  person  listens  to 
his/her  neighbors”  different  opinions  and  decides  whether  to  change  his/her  opinion  require 
a  model  that  can  handle  multiple  states. 

We  extended  a  voter  model,  a  model  of  opinion  formation  dynamics  where  the  basic 
assumption  adopted  is  that  people  change  their  opinions  following  their  neighbors’  majority 
opinion,  and  proposed  a  new  opinion  formation  model  called  Value-weighted  Mixture  Voter 
(VwMV)  Model  to  analyze  how  the  multiple  opinions  spread  over  a  large  social  network  and 
predict  future  opinion  share.  The  model  has  two  new  features.  One  is  that  each  opinion  can 
have  a  value,  a  measure  of  opinion’s  importance,  and  the  other  is  that  each  node  can  have  an 
anti-majoritarian  tendency,  a  measure  of  deviation  from  the  ordinary  behavior.  In  particular, 
the  latter  reflects  the  fact  that  there  are  always  people  who  do  not  agree  with  the  majority 
and  support  the  minority  opinion.  Both  are  parameters  in  the  model,  and  their  values  are  not 
known  in  general. 

Our  goal  was  to  1)  learn  the  parameters  from  a  limited  amount  of  observed  opinion  prop¬ 
agation  data  and  predict  the  opinion  share  in  the  near  future,  2)  identify  the  anti-majoritarians 
from  the  learned  results,  and  3)  analyze  asymptotic  behavior  of  average  opinion  dynamics 
to  uncover  its  intrinsic  characteristics. 

For  the  first  and  the  second  goals  we  showed  that  these  parameters  are  learnable  from 
a  sequence  of  observed  opinion  data  by  iteratively  maximizing  the  likelihood  function.  We 
further  showed  that  it  is  enough  to  learn  the  opinion  values  and  the  average  anti-majoritarian 
tendency  in  good  accuracy  if  the  target  is  to  predict  the  future  opinion  share,  which  can  be 
done  easily  using  a  limited  amount  of  observed  data,  but  identifying  the  anti-majoritarians  in 
good  accuracy  requires  much  longer  observation  data  because  the  anti-majoritarian  tendency 
of  each  node  has  to  be  learned.  The  learning  algorithm  is  guaranteed  to  find  the  global 
optimal  solution  when  there  are  no  anti-majoritarians  but  may  be  trapped  to  a  local  optimal 
solution  when  there  are  anti-majoritarians.  However,  the  numerical  experiment  shows  that 
the  algorithm  converges  to  a  global  optimal  if  there  is  enough  amount  of  data.  We  emphasize 
that  use  of  the  learned  model  can  predict  the  future  opinion  share  much  more  accurately  than 
a  simple  polynomial  extrapolation  can  do,  and  a  model  ignoring  these  parameters  (opinion 
values  and  the  anti-majoritarian  tendencies)  substantially  degrades  the  performance  of  share 
prediction.  We  tried  to  find  a  simpler  way  to  estimate  the  anti-majoritarian  tendency  of  each 
node,  but  there  seems  to  be  no  way.  The  heuristic  that  simply  counts  the  number  of  opinion 
updates  in  which  the  chosen  opinion  is  the  same  as  the  minority  opinion  gives  only  a  very 
poor  approximation.  Thus,  it  is  important  to  explicitly  model  the  anti-majoritarian  tendency 
to  predict  the  correct  future  opinion  share.  For  the  third  goal  we  applied  the  mean  field 
theory  and  uncovered  the  following  features.  In  a  situation  where  the  local  opinion  share  can 
be  approximated  by  the  average  opinion  share,  1 )  when  there  are  no  anti-majoritarians,  the 
opinion  with  the  highest  value  eventually  takes  over,  but  2)  when  there  is  a  certain  fraction 
of  anti-majoritarians,  it  is  not  necessarily  the  case  that  the  opinion  with  the  highest  value 
prevails  and  wins,  and  further,  3)  in  both  cases,  when  the  opinion  values  are  uniform,  the 
opinion  share  prediction  problem  becomes  ill-defined  and  any  opinion  can  win.  Although 
the  mean  field  approximation  does  not  hold  in  real  networks,  the  simulation  that  uses  the 
real  world  network  structure  supports  that  this  holds  for  real  world  social  networks  that  we 
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used  in  this  study.  We  believe  that  these  findings  are  useful  in  deepening  our  understanding 
the  behavior  of  opinion  dynamics. 
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We  addressed  the  problem  of  detecting  the  change  in  behavior  of  information  diffusion  over  a  social  network 
which  is  caused  by  an  unknown  external  situation  change  using  a  small  amount  of  observation  data  in  a 
retrospective  setting.  The  unknown  change  is  assumed  to  be  effectively  reflected  in  changes  in  the  parame¬ 
ter  values  in  the  probabilistic  information  diffusion  model,  and  the  problem  is  reduced  to  detecting  where  in 
time  and  how  long  this  change  persisted  and  how  big  this  change  is.  We  solved  this  problem  by  searching  the 
change  pattern  that  maximizes  the  likelihood  of  generating  the  observed  information  diffusion  sequences, 
and  in  doing  so  we  devised  a  very  efficient  general  iterative  search  algorithm  using  the  derivative  of  the 
likelihood  which  avoids  parameter  value  optimization  during  each  search  step.  This  is  in  contrast  to  the 
naive  learning  algorithm  in  that  it  has  to  iteratively  update  the  patten  boundaries,  each  requiring  the  pa¬ 
rameter  value  optimization  and  thus  is  very  inefficient.  We  tested  this  algorithm  for  two  instances  of  the 
probabilistic  information  diffusion  model  which  has  different  characteristics.  One  is  of  information  push 
style  and  the  other  is  of  information  pull  style.  We  chose  asynchronous  independent  cascade  (AsIC)  model 
as  the  former  and  value-weighted  voter  (VwV)  model  as  the  latter.  The  AsIC  is  the  model  for  general  in¬ 
formation  diffusion  with  binary  states  and  the  parameter  to  detect  its  change  is  diffusion  probability  and 
the  VwV  is  the  model  for  opinion  formation  with  multiple  states  and  the  parameter  to  detect  its  change  is 
opinion  value.  The  results  tested  on  these  two  models  using  four  real  world  network  structures  confirmed 
that  the  algorithm  is  robust  enough  and  can  efficiently  identify  the  correct  change  pattern  of  the  param¬ 
eter  values.  Comparison  with  the  naive  method  that  finds  the  best  combination  of  change  boundaries  by 
an  exhaustive  search  through  a  set  of  randomly  selected  boundary  candidates  showed  that  the  proposed 
algorithm  far  outperforms  the  native  method  both  in  terms  of  accuracy  and  computation  time. 
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1.  INTRODUCTION 

Recent  technological  innovation  in  the  web  such  as  blogosphere  and  knowledge/media¬ 
sharing  sites  is  remarkable,  which  has  made  it  possible  to  form  various  kinds  of  large 
social  networks,  through  which  behaviors,  ideas,  rumors  and  opinions  can  spread,  and 
our  behavioral  patterns  are  to  a  considerable  degree  affected  by  the  interaction  with 
these  networks  and  substantial  attention  has  been  directed  to  investigating  the  spread 
of  information  in  these  networks  [Newman  et  al.  2002;  Newman  2003;  Gruhl  et  al. 
2004;  Domingos  2005;  Leskovec  et  al.  2006;  Crandall  et  al.  2008;  Wu  and  Huberman 
2008]. 

These  studies  have  shown  that  it  is  important  to  consider  the  diffusion  mechanism 
explicitly  and  the  measures  based  on  network  structure  alone,  i.e.,  various  centrality 
measure,  are  not  enough  to  identify  the  important  nodes  [Kimura  2009;  2010a].  In¬ 
formation  diffusion  is  modeled  typically  by  probabilistic  models.  Most  representative 
and  fundamental  ones  for  general  information  diffusion  are  independent  cascade  (IC) 
model  [Goldenberg  et  al.  2001;  Kempe  et  al.  2003],  linear  threshold  (LT)  model  [Watts 
2002;  Watts  and  Dodds  2007]  and  their  extensions  that  include  incorporating  asyn¬ 
chronous  time  delay  [Saito  et  al.  2009b;  2010a].  The  IC  model  is  a  model  of  infor¬ 
mation  push  style,  i.e.,  the  information  sender  (a  node)  tries  to  push  the  information 
to  the  neighboring  receivers  (child  nodes)  in  a  probabilistic  way.  The  LT  model  is  a 
model  of  information  pull  style,  i.e.,  the  information  receiver  (a  node)  tries  to  pull 
the  information  from  the  neighboring  senders  (parents  nodes)  in  a  probabilistic  way. 
Since  the  focus  of  study  is  “influence”,  these  models  assume  binary  states,  i.e.,  nodes 
are  either  active  (influenced)  or  inactive  (uninfluenced).  Explicit  use  of  these  models 
to  solve  such  problems  as  the  influence  maximization  problem  [Kempe  et  al.  2003; 
Kimura  et  al.  2010a;  Chen  et  al.  2010a;  2010b]  and  the  contamination  minimization 
problem  [Kimura  et  al.  2009]  clearly  shows  the  advantage  of  the  model.  The  identi¬ 
fied  influential  nodes  and  links  are  considerably  different  from  the  ones  identified  by 
the  centrality  measures.  Another  type  of  information  diffusion  model  that  is  also  of¬ 
ten  used  is  voter  model  [Even-Dar  and  Shapria  2007]  and  its  extensions  that  include 
incorporating  opinion  values  [Kimura  et  al.  2010b],  node  strength  [Yamagishi  et  al. 
2011]  and  anti-majoritarian  tendency  [Kimura  et  al.  2011].  The  voter  model  is  a  model 
of  information  pull  style  and  is  used  to  study  the  spread  of  opinions,  i.e.,  opinion  for¬ 
mation.  It  is  similar  to  the  LT  model  in  that  the  opinion  of  a  person  is  affected  by  the 
opinions  of  his/her  neighbors.  What  is  different  from  the  LT  model  is  that  it  has  to  have 
multiple  states  if  it  has  to  deal  with  multiple  opinions1.  This  notion  is  not  necessarily 
limited  to  opinion.  Application  such  as  an  on-line  competitive  service  in  which  a  user 
can  choose  one  from  multiple  choices/decisions  requires  a  model  that  handles  multiple 
states.  There  has  been  a  variety  of  work  on  the  voter  model,  too.  Dynamical  properties 
of  the  basic  model  have  been  extensively  studied  including  how  the  degree  distribution 
and  the  network  size  affect  the  mean  time  to  reach  consensus  from  mathematical  point 
of  view  [Liggett  1999;  Sood  and  Redner  2005].  Several  variants  of  the  voter  model  are 
also  investigated  and  non  equilibrium  phase  transition  is  analyzed  [Castellano  et  al. 
2009;  Yang  et  al.  2009]  from  physics  point  of  view.  Yet  another  line  of  work  extends 
the  voter  model  by  combining  it  with  a  network  evolution  model  [Holme  and  Newman 
2006;  Crandall  et  al.  2008].  Kimura  et  al.  [2010b]  analyzed  how  the  opinion  values 
affect  the  opinion  share  dynamics  in  their  recent  study. 

What  is  common  to  all  the  above  models  is  that  they  are  all  probabilistic  models 
and  have  parameters  to  characterize  the  information  diffusion.  The  parameters  must 
be  known  in  advance  for  the  model  to  be  usable  for  analysis.  It  is  generally  difficult 


lrThe  basic  voter  model  has  only  two  opinions  but  it  is  straightforward  to  extend  it  to  handle  multiple 
opinions. 
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to  determine  the  values  of  these  parameters  theoretically,  and  thus,  attempts  have 
been  made  to  learn  these  parameter  values  by  observing  the  information  diffusion 
sequence  data  [Saito  et  al.  2009a;  2009b;  2010a;  2010b;  Gomez-Rodriguez  et  al.  2010; 
Myers  and  Leskovec  2010;  Kimura  et  al.  2010b].  In  essence  the  likelihood  of  generating 
the  observed  data  by  the  model  employed  is  first  derived,  and  then  the  parameter 
values  are  determined  such  that  the  likelihood  is  maximized.  In  particular,  Myers 
and  Leskovec  [2010]  showed  that  for  a  certain  class  of  diffusion  models,  the  problem 
can  effectively  be  transformed  to  a  convex  programming  for  which  a  global  solution  is 
guaranteed.  Another  important  common  assumption  made  in  these  studies  is  that  the 
model  is  stationary.  Since  the  model  is  probabilistic,  even  if  the  model  is  stationary,  the 
way  information  propagates  from  a  particular  node  is  not  the  same  (not  deterministic) 
and  each  time  the  diffusion  result  is  different.  However,  the  model  parameter  values 
remain  the  same  during  the  whole  course  of  analysis. 

This  paper  addresses  a  different  aspect  of  information  diffusion,  and  extends  and 
integrates  our  recent  studies  [Saito  et  al.  2011a;  Ohara  et  al.  2011].  We  note  that  our 
behavior  is  affected  not  only  by  the  behavior  of  our  neighbors  but  also  by  other  external 
factors.  The  model  only  accounts  for  the  interaction  with  neighbors.  The  behavior  we 
observe  includes  both  effects.  The  problem  we  address  here  is  to  detect  the  change 
in  the  model  from  a  limited  amount  of  observed  information  diffusion  data.  If  this  is 
possible,  this  would  bring  a  substantial  advantage.  For  example,  we  can  infer  that 
something  unusual  happened  during  a  particular  period  of  time  by  simply  analyzing 
the  limited  amount  of  data. 

This  is  in  some  sense  the  same,  in  the  spirit,  with  the  work  by  Kleinberg  [2002] 
and  Swan  and  Allan  [2000].  They  noted  a  huge  volume  of  the  data  stream,  tried  to 
organize  it  and  extract  structures  behind  it.  This  is  done  in  a  retrospective  framework, 
i.e.,  assuming  that  there  is  a  flood  of  abundant  data  already  and  there  is  a  strong 
need  to  understand  it.  Kleinberg’s  work  is  motivated  by  the  fact  that  the  appearance 
of  a  topic  in  a  document  stream  is  signaled  by  a  “burst  of  activity”  and  identifying 
its  nested  structure  manifests  itself  as  summarization  of  the  activities  over  a  period 
of  time,  making  it  possible  to  analyze  the  underlying  content  much  easier.  He  used  a 
hidden  Markov  model  in  which  bursts  appear  naturally  as  state  transitions,  and  suc¬ 
cessfully  identified  the  hierarchical  structure  of  e-mail  messages.  Swan  and  Allan’s 
work  is  motivated  by  the  need  to  organize  huge  amount  of  information  in  an  efficient 
way.  They  used  a  statistical  model  of  feature  occurrence  over  time  based  on  hypotheses 
testing  and  successfully  generated  clusters  of  named  entities  and  noun  phrases  that 
capture  the  information  corresponding  to  major  topics  in  the  corpus,  and  designed  a 
way  to  nicely  display  the  summary  on  the  screen  (Overview  Timelines).  Our  aim  is 
not  exactly  the  same  as  theirs.  We  are  interested  in  detecting  changes  in  the  external 
factors  which  are  hidden/embedded  in  the  data.  We  also  follow  the  same  retrospective 
approach,  i.e.,  we  are  not  predicting  the  future,  but  we  are  trying  to  understand  the 
phenomena  that  happened  in  the  past.  There  are  many  factors  that  bring  in  changes 
and  evidently  the  model  cannot  accommodate  all  of  them.  We  formalize  this  as  the 
unknown  changes  in  the  parameter  value  of  the  diffusion  model  we  employ,  and  we  re¬ 
duce  the  problem  to  that  of  detecting  where  in  time  and  how  long  this  change  persisted 
and  how  big  this  change  is.  We  call  the  period  where  the  parameter  takes  anomalous 
values  as  “hot  span”  and  the  rest  as  “normal  span”. 

We  have  chosen  the  asynchronous  independent  cascade  (AsIC)  model  [Saito  et  al. 
2009b;  2010a]  as  the  one  that  represents  the  model  of  information  push  style,  and  the 
value-weighted  voter  (VwV)  model  [Kimura  et  al.  2010b]  as  the  one  that  represents  the 
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model  of  information  pull  style2.  As  explained  above,  the  AsIC  is  the  model  for  general 
information  diffusion  with  binary  states  and  the  parameter  to  detect  its  change  is 
diffusion  probability  and  the  VwV  is  the  model  for  opinion  formation  with  multiple 
states  and  the  parameter  to  detect  its  change  is  opinion  value.  These  two  models  are 
recalled  in  Section  2.  We  generalized  the  parameter  optimization  algorithm  that  was 
first  introduced  in  [Saito  et  al.  2011a;  Ohara  et  al.  2011]  so  that  it  can  cover  both  the 
models  as  two  different  instances  and  expanded  the  experiments  to  verify  that  the 
same  algorithm  works  satisfactorily  for  two  different  types  of  information  diffusion 
models.  As  in  our  previous  work,  we  limit  the  form  of  change  to  a  rect-linear  one,  that 
is,  the  parameter  value  changes  to  a  new  large  value,  persists  for  a  certain  period 
of  time  and  is  restored  to  the  original  value  and  stays  the  same  thereafter  3.  In  this 
simplified  setting,  detecting  the  hot  span  is  equivalent  to  identifying  the  time  window 
where  the  parameter  value  is  anomalous  and  estimating  the  parameter  values  both  in 
the  hot  and  the  normal  spans. 

We  use  the  same  parameter  optimization  algorithm  as  in  [Saito  et  al.  2009b;  Kimura 
et  al.  2010b],  i.e.,  the  EM-like  algorithm  for  the  AsIC  model  that  iteratively  updates  the 
values  to  maximize  the  model’s  likelihood  of  generating  the  observed  data  sequences, 
and  the  Newton  method  for  the  VwV  model  that  guarantees  globally  maximizing  the 
likelihood.  However,  the  problem  here  is  more  difficult  because  it  has  another  loop  to 
search  for  the  hot  span  on  top  of  the  above  loop.  The  naive  learning  algorithm  has 
to  iteratively  update  the  patten  boundaries  (outer  loop)  and  the  value  must  also  be 
optimized  for  each  combination  of  the  pattern  boundaries  (inner  loop),  which  is  ex¬ 
traordinary  inefficient.  Our  main  contribution  is  that  we  devised  a  very  efficient  gen¬ 
eral  search  algorithm  which  works  for  probabilistic  information  diffusion  models  and 
avoids  the  inner  loop  optimization  by  using  the  information  of  the  first  order  deriva¬ 
tive  of  the  likelihood  with  respect  to  the  parameters.  We  tested  its  performance  using 
the  structures  of  four  real  world  networks  (Blog,  Coauthorship,  Enron  and  Wikipedia), 
and  confirmed  that  the  algorithm  can  efficiently  identify  the  hot  span  correctly  as  well 
as  the  parameter  values  of  both  the  normal  and  the  hot  spans.  We  further  compared 
our  algorithm  with  the  naive  method  that  finds  the  best  combination  of  the  hot  span 
boundaries  by  an  exhaustive  search  from  a  set  of  randomly  selected  boundary  candi¬ 
dates,  and  showed  that  the  proposed  algorithm  far  outperforms  the  naive  method  both 
in  terms  of  accuracy  and  computation  time. 

The  paper  is  organized  as  follows.  After  very  briefly  introducing  the  two  diffusion 
models,  AsIC  and  VwV  in  Section  2,  we  define  the  problem  in  Section  3  and  recall  how 
the  parameters  can  be  learned  in  each  model  in  Section  4.  The  main  part  is  Section  5 
where  we  explain  how  we  efficiently  search  for  the  hot  span  as  well  as  the  parameter 
values.  The  results  are  explained  in  Section  6,  followed  by  discussion  in  Section  7.  We 
end  this  paper  by  summarizing  the  main  result  in  Section  8. 

2.  INFORMATION  DIFFUSION  MODELS 

We  focus  on  two  types  of  information  diffusion  model  on  a  social  network  G  =  ( V. ,  E ), 
where  V  and  E  (c  V  x  V)  are  the  sets  of  all  the  nodes  and  the  links,  respectively. 
One  is  the  asynchronous  independent  cascade  (AsIC)  model  that  is  an  extension  of 
the  independent  cascade  (IC)  model,  and  the  other  is  the  value-weighted  voter  (VwV) 
model  that  is  an  extension  of  the  standard  voter  model.  They  were  extended  to  meet 
more  realistic  situations.  We  recall  their  definitions  below. 


2  We  could  have  chosen  AsLT  instead  of  VwV.  There  is  no  specific  reason  that  we  cannot  handle  AsLT.  Our 
aim  is  to  show  that  our  approach  is  general  enough  and  applicable  to  a  wide  variety  of  diffusion  models. 
3We  discuss  that  the  basic  algorithm  can  be  extended  to  more  general  change  patterns  in  Section  7,  and 
show  that  it  works  for  two  distinct  rect-linear  patterns  in  case  of  AsIC. 
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2.1.  Asynchronous  Independent  Cascade  (AsIC)  Model 

The  AsIC  model  we  use  in  this  paper  incorporates  asynchronous  time  delay  into  the  IC 
model  which  does  not  account  for  time-delay,  noting  that  each  node  changes  its  state 
asynchronously  in  reality  [Saito  et  al.  2009b;  2010a].  Here,  we  consider  choosing  a 
delay-time  from  the  exponential  distribution  for  the  sake  of  convenience,  but  of  course 
other  distributions  such  as  power-law  and  Weibull  can  be  employed. 

For  the  AsIC  model,  the  underlying  network  G  =  ( V,  E )  is  a  directed  graph.  For  any 
v  £  V,  the  set  of  all  the  nodes  that  have  links  from  v  (child  nodes)  is  denoted  by 

F(v)  ={«£f;  (v,u)  G  E}, 

and  the  set  of  all  the  nodes  that  have  links  to  v  (parent  nodes)  is  denoted  by 

B[v)  =  {u  G  V;  (u,v)  G  E}. 

Each  node  has  one  of  the  two  states  (active  and  inactive),  and  the  nodes  are  called 
active  if  they  have  been  influenced.  It  is  assumed  that  nodes  can  switch  their  states 
only  from  inactive  to  active. 

The  AsIC  model  has  two  types  of  parameters  pu<v  and  ru.v  with  0  <  pu<v  <  1  and  ru>v 
>  0,  where  pu.v  and  ru.v  are  referred  to  as  the  diffusion  probability  through  link  (u.  v) 
and  the  time-delay  parameter  through  link  (u,  v),  respectively.  We  define  the  diffusion- 
probability  vector  p  and  the  time-delay  parameter  vector  r  by 

P  =  {Pu,  t))(Ui  „)£E>  r  =  (rU’v)(UtV')eE- 

The  information  diffusion  process  unfolds  in  continuous -time  t,  and  proceeds  from  a 
given  initial  active  node  in  the  following  way.  When  a  node  u  becomes  active  at  time  t, 
it  is  given  a  single  chance  to  activate  each  currently  inactive  node  v  G  E(u).  A  delay¬ 
time  S  is  chosen  from  the  exponential  distribution  with  parameter  r,,  v.  The  node  u 
attempts  to  activate  the  node  v  if  v  has  not  been  activated  by  time  t  +  6,  and  succeeds 
with  probability  pu<v.  If  u  succeeds,  v  will  become  active  at  time  t  +  6.  The  information 
diffusion  process  terminates  if  no  more  activations  are  possible. 

2.2.  Value  weighted  Voter  (VwV)  Model 

The  mathematical  model  we  use  for  the  diffusion  of  opinions  is  the  VwV  model  with 
K  (>  2)  opinions  [Kimura  et  al.  2010b].  For  the  VwV  model,  the  underlying  network 
G  =  (V,  E)  is  an  undirected  (bidirectional)  graph  with  self-loops.  For  a  node  v  G  V,  let 
r(r)  denote  the  set  of  neighbors  of  v  in  G,  that  is, 

r(r)  =  {u  G  V;  (u,v)G  E}. 

Note  that  v  G  r(u)  because  of  the  existence  of  self-loops. 

In  the  VwV  model,  each  node  of  G  is  endowed  with  ( K  +  1)  states;  opinions  1,  •  •  •, 
K,  and  neutral  (i.e.,  no-opinion  state).  It  is  assumed  that  a  node  never  switches  its 
state  from  any  opinion  k  back  to  neutral.  The  model  has  a  parameter  up;  (>  0)  for 
each  opinion  k,  which  is  called  the  opinion  value  and  must  be  estimated  from  observed 
opinion  diffusion  data.  We  define  the  opinion-value  vector  w  by 

w  =  (wi  ,■■■  ,wK)- 

Let  ft.  :  V  — >  {0, 1, 2,  •  •  • ,  K}  denote  the  opinion  distribution  at  time  t,  where  ft[v) 
stands  for  the  opinion  of  node  v  at  time  t,  and  opinion  0  denotes  the  neutral  state.  We 
also  denote  by  nk(t ,  v)  the  number  of  v’s  neighbors  that  hold  opinion  k  as  the  latest  one 
before  time  t  for  k  =  1, 2,  •  •  • ,  K,  i.e. , 

Uk(t,v)  =  |{it  e  r(r);  (j>t{u)  =  fc}|, 

where  <j>t  (u)  is  the  latest  opinion  of  u  before  time  t. 
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Given  a  target  time  T,  and  an  initial  state  in  which  each  opinion  is  assigned  to  only 
one  distinct  node  and  all  other  nodes  are  in  the  neutral  state,  the  evolution  process  of 
the  model  unfolds  in  the  following  way.  At  time  0,  each  node  v  independently  decides 
its  update  time  t  according  to  some  probability  distribution  such  as  an  exponential 
distribution  with  parameter  rv,  where  ry  becomes  also  a  model  parameter  and  then  we 
define  the  time-delay  parameter  vector  r  by  r  =  (rv)vpV-  The  successive  update  time 
is  determined  similarly  at  each  update  time  t.  Node  v  changes  its  opinion  at  its  update 
time  t  as  follows:  If  node  v  has  at  least  one  neighbor  with  some  opinion  before  time 
t,  ft.(v)  =  k  with  probability  WkUkit,  v )  /  Y^k'=i  Wk’  nk'  (L  v )  f°r  &  =  1,  ■  •  • ,  K,  otherwise, 
ft(v)  =  0  with  probability  1.  It  is  noted  that  since  node  v  is  included  in  its  neighbors 
by  definition,  its  own  opinion  is  also  reflected.  The  process  is  repeated  from  the  initial 
time  t  =  0  until  the  next  update-time  attains  a  given  final-time  T. 


3.  PROBLEM  DEFINITION 

We  address  the  hot  span  detection  problem.  In  this  problem,  we  assume  that  some 
change  has  happened  in  the  way  the  information  diffuses,  and  we  observe  the  diffusion 
sequences  of  a  certain  topic  in  which  the  change  is  embedded,  and  consider  detecting 
where  in  time  and  how  long  this  change  persisted  and  how  big  this  change  is.  In  the 
following  subsections,  we  describe  a  specific  detection  problem  by  focusing  on  the  above 
diffusion  models,  i.e.,  the  AsIC  model  and  the  VwV  model. 


3.1.  AsIC  Model 

An  information  diffusion  result  generated  by  the  AsIC  model  is  represented  as  a  set  of 
pairs  of  active  nodes  and  their  activation  times;  i.e.,  {( u ,  tu),  ( v ,  tv),  ■  ■  ■}.  We  consider  a 
diffusion  result  'D ( 0 ,  T),  where  the  initial  activation  time  is  set  to  0  and  the  final  obser¬ 
vation  time  is  denoted  by  T.  Since  we  employ  only  a  single  diffusion  result  V({).  T),  we 
place  a  constraint  that  pu,v  and  do  not  depend  on  link  ( u ,  v),  i.e.,  pU:V  =  p,  rUiV  =  r 
(\/(u,v)  G  E),  which  should  be  acceptable  noting  that  we  can  naturally  assume  that 
people  behave  quite  similarly  when  talking  about  the  same  topic  (see  Section  7). 

Let  [Ti ,  T2 )  denote  the  hot  span  of  the  information  diffusion,  and  let  p„  and  pn  denote 
the  diffusion  probability  for  the  normal  span  and  the  hot  span,  respectively.  Namely, 
the  diffusion  probability  p  is  obtained  by  p  =  pn  for  the  period  [0,Ti),  p  =  ph  for  the 
period  [Ti,T2),  and  p  =  pn  for  the  period  \T2,T).  Here  we  assume  for  simplicity  that 
the  time-delay  parameter  r  does  not  change  and  takes  the  same  value  for  the  entire 
period  [0,  T).  Then,  the  hot  span  detection  problem  is  reduced  to  detecting  the  hot  span 
[Ti,  T2)  and  estimating  pn  and  ph  from  the  observed  diffusion  result  V(0,  T ). 

Figure  1  shows  five  examples  of  diffusion  sample  with  (Fig.  lb)  and  without  (Fig.  la) 
a  hot  span  based  on  the  AsIC  model,  where  the  parameters  are  set  at  pn  =  0.1,  ph  =  0.3, 
r  =  1.0,  Ti  =  10,  T2  =  20.  The  network  used  is  the  blog  network  described  later  in 
Subsection  6.1.  We  plotted  the  ratio  of  active  nodes  (the  number  of  nodes  activated 
at  time  step  t  divided  by  the  number  of  total  active  nodes  over  the  whole  time  span) 
for  five  independent  simulations,  each  from  a  randomly  chosen  initial  source  node  at 
time  t  =  0.  Comparing  these  two  figures,  we  can  clearly  see  bursty  activities  around 
the  hot  span  [10, 20)  in  Fig.  lb.  However,  each  curve  in  Fig.  lb  behaves  differently, 
i.e.,  some  has  its  bursty  activities  only  in  the  first  half,  some  other  has  them  only  in 
the  last  half,  and  yet  some  other  has  two  peaks  during  the  hot  span.  This  means  that 
it  is  quite  difficult  to  accurately  detect  the  true  hot  span  from  only  a  single  diffusion 
sample.  Methods  that  use  only  the  observed  bursty  activities,  including  those  proposed 
by  Swan  and  Allan  [2000]  and  Kleinberg  [2002]  would  not  work. 
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(a)  Diffusion  samples  without  a  hot  span  (b)  Diffusion  samples  with  a  hot  span 


Fig.  1:  Information  diffusion  in  the  blog  network  for  the  AsIC  model.  Results  of  live 
independent  runs  are  shown. 


(a)  An  example  of  opinion  population  curves  (b)  An  example  of  opinion  population  curves 
without  a  hot  span  (sample  #1)  without  a  hot  span  (sample  #2) 


(c)  An  example  of  opinion  population  curves  (d)  An  example  of  opinion  population  curves 
with  a  hot  span  (sample  #1)  with  a  hot  span  (sample  #2) 

Fig.  2:  Information  diffusion  in  the  blog  network  for  the  VwV  model. 


3.2.  VwV  Model 

Similarly  to  the  detection  problem  for  the  AsIC  model,  let  [Xi,  T2)  denote  the  hot  span 
of  the  diffusion  of  opinions  under  the  VwV  model.  Recall  that  this  implies  that  the 
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intervals  [0,  Ti)  and  [T2,T)  are  the  normal  spans.  We  place  the  same  assumption  that 
there  is  no  change  in  the  value  of  the  time-delay  parameter  vector  r  for  simplicity.  Let 
wn  and  Wf,  denote  the  opinion- value  vectors  for  the  normal  span  and  the  hot  span, 
respectively.  Note  that  wn/\\wn\\  /  wh/\\wh\\  since  the  opinion  dynamics  under  the 
VwV  model  is  invariant  to  positive  scaling  of  the  opinion- value  vector  w,  where  ||m„|| 
and  |?c/,||  stand  for  the  norm  of  vectors  wn  and  wh,  respectively.  Then,  the  change 
detection  problem  is  formulated  as  follows:  Given  the  opinion  diffusion  data  V(0 ,T)  in 
time-interval  [0,  T),  detect  the  hot  span  [Ti,  T2),  and  estimate  the  opinion-value  vector 
wh  of  the  hot  span  and  the  opinion-value  vector  wn  of  the  normal  span.  Here,  2?(0,  T) 
consists  of  a  sequence  of  ( v ,  t ,  k)  such  that  node  v  changed  its  opinion  to  opinion  k  at 
time  t. 

Figure  2  shows  two  examples  of  opinion  diffusion  sample  with  (Figs.  2c  and  2d)  and 
without  (Figs.  2a  and  2b)  a  hot  span  based  on  the  VwV  model  with  K  =  3  opinions, 
where  the  opinion-value  vectors  are  set  at  w  =  (2.0, 1.0, 1.0)  for  Figs.  2a  and  2b,  and 
wn  =  (2.0, 1.0, 1.0),  wh  =  (3.0, 1.0, 1.0),  Ti  =  10  and  T2  =  20  for  Figs.  2c  and  2d.  The 
network  used  is  the  same  blog  network  as  in  Fig.  1.  We  plotted  the  population  of  each 
opinion  k,  |{u  £  V;  ft(v)  =  k}\,  as  a  function  of  time  t.  It  must  be  difficult  to  know  the 
existence  of  a  hot  span  from  only  their  curves  depicted  in  Figs.  2b  and  2d.  Moreover, 
since  the  VwV  model  is  a  stochastic  process  model,  every  sample  of  opinion  diffusion 
can  behave  differently.  Again,  this  means  that  it  is  quite  difficult  to  accurately  detect 
the  true  hot  span  from  only  a  single  sample  of  opinion  diffusion.  We  believe  that  an 
explicit  use  of  underlying  opinion  diffusion  model  is  essential  to  solve  this  problem.  It 
is  crucially  important  to  detect  the  hot  span  precisely  in  order  to  identify  the  external 
factors  which  caused  the  behavioral  changes. 

4.  MODEL  PARAMETER  LEARNING 

We  describe  the  framework  of  model  parameter  learning  as  a  likelihood  maximization 
problem  for  the  AsIC  and  the  VwV  models. 


4.1.  Parameter  Learning  for  AsIC  Model 

First,  we  consider  estimating  the  values  of  diffusion  probability  p  and  time-delay  pa¬ 
rameter  r  from  an  observed  diffusion  result  2?(0,  T)  =  {•  •  • ,  (v,tv),  ■  ■  •}  when  there  is 
no  hot  span.  Recall  that  the  initial  activation  time  is  set  to  0  and  the  final  observation 
time  is  denoted  by  T.  Let  V  be  the  set  of  all  the  activated  nodes  in  V(0,  T),  i.e., 

V  =  {v£V;  (v,tv)  €2?(0,T)}. 

For  each  node  v  £  V,  let  Av  be  the  set  of  its  parent  nodes  that  had  a  chance  to  activate 
it,  i.e., 


Av  =  {u  £  B(v);  ( u,tu )  £  T>(0, T),tu  <  tv}. 

Although  we  place  a  constraint  that  pUfV  =  p,  r,M,  =  r  (V(u,  v)  £  E),  we  develop  a  gen¬ 
eral  theory  in  terms  of  p  and  r  to  be  consistent  with  the  description  in  Subsection  5.2. 
Let  XUtV(pu  v,  ru<v)  denote  the  probability  density  that  a  node  u  £  Av  activates  the  node 
v  at  time  tv,  that  is, 


Xu,v{j?u,v>Tu,v)  Pu,v  Tu,v  *-Xp(  ^'u.v (iy  Li.)) ■ 


(1) 


Let  yu,v(Pu,v,  ru,v)  denote  the  probability  that  the  node  v  is  not  activated  by  a  node  u  £ 
Av  within  the  time-period  (tu,tv),  that  is, 


{Pu,v  ^  ^U.V )  —  1  Pu,v  j  1"u,v  exp(  T’u  }v(t  til))df 

-  Pu,V  exp  (  TU,V(tV  ty  ) )  T  (1  Pu,v)- 


(2) 
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By  using  Eqs.  (1)  and  (2),  we  can  obtain  the  probability  density  hv(p,  r)  that  a  node  v 
is  activated  at  time  tv, 

^  ^  ‘^u,v{jpu,v>1'u,v)  I  ||  y z,v(jPz,vi  Tz,v)  1  ?  (3) 

u£Av  y2ev4v\{n}  J 

and  the  probability  4>v  z{pv  z,rv  z)  that  a  node  .?  is  not  activated  by  a  node  v  within 

[0,T), 

ipv,z{pv,z,rv,z)  =  Pv,zexp(-rv,z(T  -  tv))  +  (1  -pv,z).  (4) 

Then,  from  Eqs.  (3)  and  (4),  the  following  log  likelihood  function  C(p,  r;  X>(0,  T))  can  be 
obtained  for  observed  data  V(0,  T) 


£(p,r;2>(0,T)) 


E 

vev 


^log  hv(p,r) 


E  log  ipv,z{pv,z 

z£F(v)\T) 


(5) 


Here,  we  recall  pU}V  =  p,  ru>v  =  r  for  any  ( u ,  v)  G  E.  The  values  of  parameters  p 
and  r  can  be  stably  obtained  by  maximizing  Eq.  (5)  using  an  EM-like  algorithm  (see 
Appendix  A  for  more  details). 

Now,  we  assume  that  there  exists  a  hot  span  S  =  [Ti,T2).  Let  pit)  denote  the  value 
of  parameter  p  at  time  t.  According  to  our  problem  setting,  we  consider  the  parameter 
switching, 

n(t\-fPn  if  t  €  [0,  T)  \  S, 
m  -  {  Ph  if  te  S. 

For  the  hot  span  S,  we  split  the  set  of  the  active  nodes  V  as  follows: 

Vn(S)  =  {v€V;  tve[0,T)\S}, 

Vh{S)  =  {v  G  V-  tv  G  S}. 

For  any  v  G  V,  let  hv(pn,  ph,  r;  S)  be  the  probability  density  that  node  v  is  activated  at 
time  tv  when  there  exists  hot  span  S.  By  using  Eqs.  (1)  and  (2),  we  obtain 


K{Pn,Ph,r;S) 


—  ^  ^  (pn  i  r ) 

ueAvnvn(S) 


]^|  .Tz,?;(Pn>r)  1 1  yz,v{Phi  r) 

vzeXnB„(S)\{«}  zGAvnvh(S) 


^  ^  %u,viPh-)T)  |  |  yz,v(Pn,r)  |  |  yz,viPhi  l 


(6) 


uGAvnvh(S) 


izeA„nn„(5) 


z€A.n%(S)\{u} 


Using  Eqs.  (4)  and  (6),  we  can  define  an  objective  function  C{pn,ph,r-,V(0,T),  S)  for 
the  hot  span  detection  problem  by  adequately  modifying  Eq.  (5)  under  the  switching 
scheme  as  follows: 


£{Pn,Ph,r-,T>(0,T),S) 

=  yioghv{pn,Ph,r-,s)+  y  y  log  Tpv,z  {Pm  r ) 

v£V  veT>n(S)  z£F(v)\V 

+  E  E  log  tpv,z(Ph,r).  (7) 

v&VhiS)  z&F(v)\V 
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Clearly,  C(pn,ph,  r;  27(0,  T ),  S)  is  expected  to  be  maximized  by  setting  S  to  the  true  hot 
span  S*  =  [Tf,  T|)  if  a  substantial  amount  of  data  27(0,  7’)  is  available.  Thus,  our  hot 
span  detection  problem  is  formalized  as  the  following  maximization  problem: 

S  =  axgmax C(pn(S),ph{S),r(S)-,  27(0,7"),  S),  (8) 

where  pn{S),  PiJS),  and  r(S)  denote  the  maximum  likelihood  estimators  for  a  given  S. 


4.2.  Parameter  Learning  for  VwV  Model 

We  also  consider  estimating  the  value  of  opinion-value  vector  w  from  an  observed 
opinion  diffusion  data  27(0,  T)  in  time  interval  [0,T)  (a  single  example)  when  there  is 
no  hot  span4.  From  the  evolution  process  of  the  model,  we  can  obtain  the  following  log 
likelihood  function 


£(m;2?(0,T))=log  [] 

( v,t,k)GC{0,T ) 


nk(t,v)wk 
EfcLl  nk'(t,  v)wk'  ’ 


(9) 


where 

C(0,T)  =  £  27(0,  T);  \{u  £  r(v);  ft(u)  ^  0}|  >  2}.5 

Thus,  our  estimation  problem  is  formulated  as  a  maximization  problem  of  the  log  like¬ 
lihood  function  C(w;  27(0,  T))  with  respect  to  w.  We  find  the  optimal  value  of  w  by 
employing  a  standard  Newton  method  (see  Appendix  B  for  more  details). 

Now,  we  assume  that  there  exists  a  hot  span  S  =  [Ti,T2).  Let  w(t)  denote  the  value 
of  opinion-value  vector  w  at  time  t.  We  also  consider  the  following  parameter  vector 
switching: 


w(t) 


wn  if  t  G  [0,  T)  \  S, 

wh  if  t  £  S. 


For  VTS,  Te  with  0  <  7’s  <  7(  <  7',  we  denote  by  27 (Ts,  Te)  the  opinion  diffusion  data  in 
time  interval  [Ts,Te);  i.e., 

V(Ts,Te)  =  {(v,t,k)  £27(0,7);  t  £  [Ts,  Te)}.  (10) 

Then,  similarly  to  the  case  of  the  AsIC  model,  an  objective  function 
C{wn,  wh;  27(0,  T),  S)  can  be  defined  for  the  hot  span  detection  problem  by  ade¬ 
quately  modifying  Eq.  (9)  under  this  switching  scheme  as  follows: 

C(wn,wh;V(0 ,T),S)  =  C{wn\ 27(0,  Tj)  U  27(T2,  T))  +  £(wh;  27(T1;  T2)).  (11) 

Again,  the  extended  objective  function  is  expected  to  be  maximized  by  setting  S  to 
be  the  true  span  S*  =  [T*,T|),  provided  that  27(0,  T)  is  generated  by  the  VwV  model 
with  hot  span  S*  and  is  sufficiently  large.  Therefore,  our  hot  span  detection  problem  is 
formalized  as  the  following  maximization  problem: 

S  =  arg  maxjC(wn(S),  wh(S);  27(0,  T),  S),  (12) 

where  wn(S )  and  WhiS)  denote  the  maximum  likelihood  estimators  for  a  given  S. 


4The  time-delay  parameter  vector  r  can  simply  be  estimated  by  averaging  the  time  intervals  for  each  node, 
and  thus  excluded  from  the  estimation  problem. 

’We  use  only  those  observed  data  in  which  there  is  at  least  one  neighbor  that  has  an  opinion. 
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5.  CHANGE  DETECTION  METHODS 

We  propose  a  general  method  of  detecting  a  hot  span  that  is  applicable  to  both  the 
AsIC  model  and  the  VwV  model.  In  order  to  obtain  the  optimal  hot  span  S  according  to 
either  Eq.  (8)  or  Eq.  (12),  we  need  to  prepare  a  reasonable  set  of  candidate  hot  spans, 
denoted  by  'H.  One  way  of  doing  so  is  to  construct  'H  by  considering  all  pairs  of  observed 
activation  (or  opinion  change)  time  points.  In  general,  let  T  denote  the  set  of  all  the 
observed  activation  (or  opinion  change)  time  points, 


T  =  {to,  ti,  •  •  ■ ,  i at},  (0  =  to  <  ti  <  ■  ■  ■  <  tjv  <  T). 

Then,  we  can  construct  a  set  of  candidate  hot  spans  by 

n  =  {s  =  [Ti,r2);  Ti  <  t2,  t-[  £  r,  t2  e  r}. 

Hereafter,  we  denote  the  model  parameter  vector  by  9 ;  i.e.,  9  =  ( p,r )  for  the  AsIC 
model  and  9  =  w  for  the  VwV  model.  Since  the  parameter  vector  9  is  a  function  of  time 
t  in  our  problem  setting,  we  denote  by  0(f)  the  value  of  9  at  time  t.  Given  a  hot  span 
S  =  [Ti,  T2),  we  consider  the  following  parameter  vector  switching: 


j  9n  if  t  e  [0,  T)  \  S, 
\  9h  if  t€S. 


Let  S*  =  [T* ,  T2* )  be  the  true  hot  span.  We  assume  that  observed  data  T>(0,T)  is  gen¬ 
erated  by  using  the  parameter  vector  9*  (t)  of  hot  span  S*.  In  what  follows,  after  intro¬ 
ducing  a  naive  method,  we  describe  our  proposed  detection  method. 


5.1.  Naive  Method 

Both  Eq.  (8)  and  Eq.  (12)  can  be  solved  by  a  naive  method  which  has  two  iterative 
loops.  In  the  inner  loop  we  first  obtain  the  maximum  likelihood  estimators,  9„  and  d/,  , 
for  each  candidate  S  by  maximizing  the  objective  function  C(9„  .  9tl :  V((),  T),  S)  using 
either  the  EM-like  algorithm  or  the  Newton  method.  In  the  outer  loop  we  select  the 
optimal  S  which  gives  the  largest  C(9„.  9h\ 2X0,  T),S)  value.  However,  this  method  can 
be  extremely  inefficient  when  the  number  of  candidate  spans  is  large.  Thus,  in  order  to 
make  it  work  with  a  reasonable  computational  cost,  we  consider  restricting  the  number 
of  candidate  time  points  to  a  smaller  value,  denoted  by  J,  i.e.,  we  construct  7}  (C  T) 
by  selecting  J  points  from  T ;  then  we  construct  a  restricted  set  of  candidate  spans  by 

1Hj  =  {S  =  [7i,T2);  Ti  <  T2,TX  e  Tj,T2  e  Tj}. 

Note  that  \Hj\  =  J(J  —  l)/2,  which  is  large  when  J  is  large. 

5.2.  Proposed  Method 

It  is  easily  conceivable  that  the  naive  method  can  detect  the  hot  span  with  a  reasonably 
good  accuracy  when  we  set  J  large  at  the  expense  of  the  computational  cost,  but  the 
accuracy  becomes  poorer  when  we  set  J  smaller  to  reduce  the  computational  load. 
We  propose  a  novel  detection  method  below  which  alleviates  this  problem  and  can 
efficiently  and  stably  detect  a  hot  span  from  the  diffusion  result  2?(0,  T). 

We  first  obtain  the  maximum  likelihood  estimators,  9,  based  on  the  original  objective 
function  of  either  Eq.  (5)  or  Eq.  (9).  Next,  we  focus  on  the  first-order  derivative  of  the 
objective  function  C(9;  X>(0,  T))  with  respect  to  the  parameter  vector  9  in  each  observa¬ 
tion  interval  [tj-\,tj).  More  specifically,  we  define  a  function  C(9I ,  ■  ■  ■  ,9^:'D((i.  7’))  of 
9 1,  •••,  9n  by 

C{6u-  -,9n-,V(0,T))  =  £(0(i);Z>(O,T)), 
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Fig.  3:  Direction  of  the  gradient  vector  at  0  in  the  normal  and  the  hot  span. 


where  0(f)  =  0 j  iff  e  (j  =  1,---,N).  Since  £(0;£>(O  ,T))  =  £(0,  ■  ■  ■ ,  0:  V(0,  T)) 

and  0  is  the  maximum  likelihood  estimator  based  on  £(0:  V((),  T)),  i.e. ,  when  no  change 
in  0  is  assumed,  we  have 


N 


Q  8C{0 ;  X>(0,  T))  _  ^  0£(0,  ■  -  •  ,0;  X>(0,  T)) 


80 


3  =  1 


50, 


(13) 


Note  that  £(0 1,  ■  •  • ,  0;y:  2?(0,  T))  can  be  expected  to  attain  the  maximum  when  each  0, 
is  given  as  follows:  0j  =  0h  if  [tj-i,tj)  is  included  in  the  hot  span  and  0:i  =  0n  if  [tj-i .  f,j 
is  included  in  the  normal  span,  i.e.,  £(0i,  •  •  • ,  0N;T>(O,  T))  =  C(0n,0h;T>(O,T),S*). 
Thus,  we  introduce  modification  vectors,  i?i,  ■  •  ■ ,  i9jv,  defined  by  0t  =  0/,— 0  if  is 

included  in  the  hot  span  and  d,  =  0n  —  0  if  [tj-i,  tj )  is  included  in  the  normal  span.  Let 
AC  be  C(0n.  0h ;  Z>(0,  T),S*)  -  £(0 ,  •  •  • ,  0;  Z>(0,  T))  =  £(0  + tb,  •  •  • ,  0  +  w!  X>(0,  T),  S*)  - 
£(0,  •  •  • ,  0;  2?(0,  T)).  Then,  we  can  obtain  the  following  first-order  Taylor  expansion: 


A  C. 


N 

E 

3= 1 


d£(0,---,0;D(O,T)) 


50, 


0, 


E 

r, 


d£(0,---,0;P(O,r)) 

(90, 


(0.-0)+  E 

r, 


8C(0,  •  •  • ,  0;  D(0,  T)) 


90, 


Moreover,  by  noting  Eq.  (13),  we  obtain  the  following  result: 

8C(0,  •  •  • ,  0;  2?(0,  T)) 


A£ 


E 

i; 


00, 


(0.  —  0n)  ■ 


(0n-0). 


(14) 


Here  note  that  we  can  naturally  assume  that  each  gradient  vector  with  respect  to  0j  is 
likely  to  be  parallel  to  (0h  —  0n),  as  shown  by  arrows  in  Fig.  3.  Therefore,  from  Eq.  (14), 
by  considering  the  following  partial  sum  for  a  candidate  hot  span  S  =  [T),  T2)  G  % : 


9<S)  =  £ 

r,  [tj-i,tj)cs 


d£(0,---,0;P(O,T)) 

80, 


we  can  expect  that  119(5)11  is  maximized  when  S  «  5*. 

Therefore,  we  propose  the  method  of  detecting  the  hot  span  by 


S  =  argmax||g(5)||. 

Ocri 


(15) 


(16) 


In  case  of  the  AsIC  model, 


Il0(5)||2  = 


E 

( u,v)EE ;  u£T>h(S) 


8C(p,f-,V(0,T)) 


8pu,t 
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(see  Eq.  (5)),  and  in  case  of  the  VwV  model, 


9(S ) 


ac(w;x>(ri,r2)) 

dw 


(see  Eqs.  (9)). 

Here  note  that  we  can  incrementally  calculate  g{S).  More  specifically,  we  can  obtain 
the  following  formula: 


g([U,tj- 1))  + 


<>C,0.  ■  ■  ■.():  /hi),  7  jj 

do, 


(17) 


for  any  ti,tj-i,tj  G  T  with  t,  <  tj- \  <  tr  The  computational  cost  of  the  proposed 
method  for  examining  each  candidate  span  is  much  smaller  than  the  naive  method 
described  above.  When  |T|  is  very  large,  we  construct  a  restricted  set  of  candidate 
spans  T-Lj  as  explained  above.  We  summarize  our  proposed  method  below. 


1.  Maximize  C(0;  T>{ 0,  T))  by  using  the  parameter  estimation  method. 

2.  Construct  the  candidate  time  set  T  and  the  candidate  hot  span  set  'H. 

3.  Detect  the  hot  span  S  by  Eq.  (16)  and  output  S. 

4.  Maximize  C(On,6h;V(0,T),  S)  by  using  the  parameter  estimation  method,  and 
output  (0n.  0k). 

Here  note  that  the  proposed  method  requires  likelihood  maximization  by  using  the 
parameter  estimation  method  only  twice. 


6.  EXPERIMENTAL  EVALUATION 

We  experimentally  investigated  how  accurately  the  proposed  method  can  estimate 
both  the  hot  span  and  the  diffusion  parameters  for  the  hot  and  the  normal  spans, 
as  well  as  its  efficiency,  by  comparing  it  with  the  naive  method  using  four  real  world 
networks. 


6.1.  Datasets 

We  used  four  real  large  networks  which  are  all  bidirectionally  connected6.  The  first 
one  is  a  trackback  network  of  Japanese  blogs  used  in  [Kimura  et  al.  2009].  It  has 
12, 047  nodes  and  79, 920  directed  links  (the  blog  network).  The  second  one  is  a  coau¬ 
thorship  network  used  in  [Palla  et  al.  2005],  which  has  12, 357  nodes  and  38, 896  di¬ 
rected  links  (the  Coauthorship  network).  The  third  one  is  a  network  derived  from  the 
Enron  Email  Dataset  [Klimt  and  Yang  2004]  by  extracting  the  senders  and  the  recipi¬ 
ents  and  linking  those  that  had  bidirectional  communications.  It  has  4, 254  nodes  and 
44, 314  directed  links  (the  Enron  network).  The  fourth  one  is  a  network  of  people  that 
was  derived  from  the  “list  of  people”  within  Japanese  Wikipedia,  used  in  [Kimura  et  al. 
2008],  and  has  9,481  nodes  and  245, 044  directed  links  (the  Wikipedia  network). 

6.2.  Experimental  Settings 

We  generated  diffusion  results  using  both  the  AsIC  model  (for  information  diffusion 
evaluation)  and  the  VwV  model  (for  opinion  population  diffusion  evaluation)  for  each 
of  the  above  networks  under  the  following  settings.  As  for  the  AsIC  model,  we  con¬ 
sidered  p  —  1/d  as  the  base  value  of  the  diffusion  probability  of  each  link,  where  d  is 
the  mean  out-degree  of  the  network.  With  this  base  value,  for  an  arbitrary  node  in  the 


6We  wanted  to  use  the  real  data  measured  in  the  real  network  where  there  is  a  known  external  change,  but 
unfortunately  we  were  not  able  to  find  such  data.  We  are  still  looking  for  a  good  dataset  that  can  be  used  to 
validate  our  approach. 
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network,  the  expected  number  of  its  child  nodes  that  it  succeeds  to  activate  becomes 
approximately  equal  to  one  at  least  at  an  early  phase  of  the  information  diffusion.  If 
the  diffusion  probability  is  much  smaller  than  the  base  value,  the  diffusion  process 
would  end  up  with  only  a  small  number  of  active  nodes  on  the  average.  On  the  other 
hand,  if  it  is  much  larger,  the  information  rapidly  spreads  out  the  entire  network  and 
the  process  finishes  at  an  early  phase  of  the  diffusion.  Both  cases  are  not  appropriate 
to  our  aim  of  investigating  the  hot  span  detection,  i.e.,  we  need  a  fair  amount  of  infor¬ 
mation  diffusion  taking  place  around  the  hot  span.  Thus,  in  our  experiments,  we  set 
the  diffusion  probability  for  the  normal  span,  p* ,  to  be  a  value  slightly  smaller  than 
the  base  value,  and  set  the  diffusion  probability  for  the  hot  span,  p*h,  to  be  three  times 
as  large  as  the  p*.  As  a  result,  p*n  and  p*h  are  0.1  and  0.3  for  the  blog  network,  0.2  and  0.6 
for  the  Coauthorship  network,  0.05  and  0.15  for  the  Enron  network  and  0.02  and  0.06 
for  the  Wikipedia  network,  respectively.  As  explained  in  3.1,  we  assumed  that  the  time 
delay  parameter  does  not  change,  and  fixed  its  value  to  be  1  (r*  =  1)  for  all  the  net¬ 
works  because  changing  r*  works  only  for  scaling  the  time  axis  of  the  diffusion  results. 
We  set  the  observation  period  to  be  [0,  T  =  30)  and  the  hot  span  to  be  [T*  =  10,  T2*  =  20) 
based  on  the  observation  on  the  preliminary  experiments.  In  all  we  generated  10  infor¬ 
mation  diffusion  results  using  these  parameter  values,  each  starting  from  a  randomly 
selected  initial  active  node  for  each  network. 

As  for  the  VwV  model,  for  each  of  the  above  networks,  we  generated  opinion  diffusion 
results  according  to  the  model  for  three  different  values  of  K  (the  number  of  opinions), 
i.e.,  K  =  2,  4,  and  8,  by  choosing  the  top  K  nodes  with  respect  to  node  degree  ranking 
as  the  initial  K  nodes.  We  assumed  that  the  value  of  all  the  opinions  were  initially  1.0, 
i.e.,  the  value-parameters  for  all  the  opinions  are  1.0  for  the  normal  span,  and  further 
assumed  that  only  the  value  of  the  first  opinion  changed  to  double  for  the  hot  span,  i.e., 
the  value-parameter  of  the  first  opinion  is  2.0  and  the  value-parameters  of  all  the  other 
opinions  are  1.0  for  the  hot  span.  Again,  based  on  the  observation  on  the  preliminary 
experiments,  we  set  the  observation  period  and  the  hot  span  to  be  [0,  T  =  25)  and 
[T*  =  1 0. 72*  =  15),  respectively,  and  generated  10  opinion  diffusion  results  for  each 
network. 

We  then  estimated  the  hot  span  [T* ,  T2 )  and  the  diffusion  parameters  of  each  model, 
i.e.,  the  diffusion  probabilities  p*n  (for  the  normal  span)  and  p*h  (for  the  hot  span)  for 
the  AsIC  model,  and  the  opinion-value  vectors  to*  (for  the  normal  span)  and  w*h  (for 
the  hot  span)  for  the  VwV  model  by  the  two  methods  (the  proposed  and  the  naive),  and 
compared  them  in  terms  of  1)  the  accuracy  of  the  estimated  hot  span  S  =  [T\,T2),  2) 
the  accuracy  of  the  estimated  diffusion  parameters,  pn,  ph,  wn,  and  wh,  3)  their  inte¬ 
grated  estimation  error,  and  4)  the  computation  time.  The  accuracy  of  the  estimated 
hot  span  is  measured  in  the  absolute  error  £$  =  \T\  —  T* \  +  \T2  —  T2*  |  for  both  the 
AsIC  and  VwV  models.  The  accuracy  of  the  estimated  diffusion  parameters  is  eval¬ 
uated  in  the  mean  relative  error,  i.e.,  £p  =  | pn  —  pn\/pn  +  \Ph  —  Ph\/Ph  for  the  AsIC 
model,  and  £w  =  wn.  —  w*.\/w* .  +  \wht  —  w^/w^J/K  for  the  VwV  model,  where 
w*.  and  w*h.  are  values  of  opinion  i  for  the  normal  and  the  hot  spans,  respectively, 
and  U>n,  and  w*h  are  their  estimated  values.  Integrating  their  estimation  errors  by 
£()if)  =  /0T  \  \9(t)—9*(t)\\Lidt  allows  us  to  evaluate  the  estimation  ability  of  each  method 

in  a  comprehensive  manner,  where  0*(t)  and  9(t)  is  the  diffusion  parameter  vector  to 
be  assumed  true  and  its  estimation  at  time  t  for  the  corresponding  model,  respectively. 
For  the  proposed  method,  we  adopted  1, 000  as  the  value  of  J  (the  number  of  candidate 
time  points)  for  the  VwV  model,  while  we  used  all  the  possible  time  points,  i.e.,  J  =  N 
for  the  AsIC  model  because  the  number  of  time  points  for  opinion  changes  in  the  VwV 
model  is  observed  to  be  much  larger  than  the  number  of  node  activation  for  the  AsIC 
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(a)  Blog  (b)  Coauthorship  (c)  Enron  (d)  Wikipedia 

Fig.  4:  Comparison  of  the  accuracy  in  the  estimated  hot  span  for  the  AsIC  model. 
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(a)  Blog  (b)  Coauthorship  (c)  Enron  (d)  Wikipedia 

Fig.  5:  Comparison  of  the  accuracy  in  the  estimated  diffusion  probability  for  the  AsIC 
model. 
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(a)  Blog  (b)  Coauthorship  (c)  Enron  (d)  Wikipedia 

Fig.  6:  Comparison  of  the  integrated  estimation  error  for  the  AsIC  model. 


(a)  Blog  (b)  Coauthorship  (c)  Enron  (d)  Wikipedia 

Fig.  7:  Comparison  of  the  computation  time  for  the  AsIC  model. 


model  for  the  same  period  of  time.  For  the  naive  method,  we  tested  three  cases  of  J  =  5, 
10,  and  20  for  both  the  models.  Both  the  proposed  and  the  naive  methods  were  tested 
on  each  diffusion  result  for  each  model  mentioned  above  on  a  PC  with  Intel  Core  2  Duo 
3GHz,  and  the  results  were  averaged  over  the  ten  independent  trials  for  each  network. 
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6.3.  Results  for  AsIC  Model 

Figures  4  to  7  summarize  the  results  for  the  AsIC  model.  Figure  4  shows  the  accu¬ 
racy  of  S  in  the  absolute  error  £s  defined  above.  We  see  that  the  proposed  method 
achieves  a  good  accuracy,  much  better  than  the  naive  method  for  every  network.  As 
expected,  £s  for  the  naive  method  decreases  as  J  becomes  larger.  But,  even  in  the  best 
case  ( J  =  20),  its  average  error  is  about  3  to  10  times  larger  than  that  of  the  proposed 
method.  Figure  5  shows  the  accuracy  of  pn  and  ph  in  the  relative  error  £p.  Here  again, 
the  average  relative  error  for  the  naive  method  decreases  as  J  becomes  larger.  How¬ 
ever,  even  in  the  best  case  ( J  =  20),  it  is  about  2  to  3  times  larger  than  that  of  the 
proposed  method.  We  note  that  the  average  errors  for  the  Coauthorship  network  are 
relatively  large.  This  is  because  the  number  of  active  nodes  within  the  normal  span 
was  relatively  small  for  this  network.  Figure  6  shows  the  integrated  estimation  error 
given  by  £g(ty  which  supplements  our  insights  derived  from  the  above  results.  For  ex¬ 
ample,  although  the  relative  error  of  the  estimated  diffusion  probabilities  of  the  naive 
method  ( J  =  20)  is  less  than  twice  as  big  as  the  proposed  method  for  the  Enron  net¬ 
work,  its  value  of  £g^  becomes  more  than  twice  of  the  proposed  method  by  considering 
the  estimation  error  of  the  hot  span.  Overall,  even  in  the  best  case  of  the  naive  method 
(J  =  20),  its  integrated  estimation  error  is  about  2  to  4  times  larger  than  that  of  the 
proposed  method.  Figure  7  shows  the  computation  time.  It  is  clear  that  the  proposed 
method  is  much  faster  than  the  naive  method.  The  significant  difference  is  attributed 
to  the  difference  in  the  number  of  runs  of  the  EM-like  algorithm.  The  proposed  method 
executes  the  EM-like  algorithm  only  twice:  steps  1  and  4  in  the  algorithm  (see  Section 
5.2).  On  the  other  hand,  the  naive  method  has  to  execute  the  EM-like  algorithm  once 
for  every  single  candidate  hot  span  S  G  T-Lj  which  is  \Hj\  =  J{J  —  l)/2  times  (see  Sec¬ 
tion  5.1).  Indeed,  the  computation  time  of  the  naive  method  for  J  =  5  is  about  5  times 
larger  for  every  network,  which  is  consistent  with  the  fact  that  \H^\  =  10.  This  relation 
roughly  holds  also  for  the  other  two  cases  ( J  =  10  and  J  =  20).  This  means  that  even 
if  the  naive  method  could  achieve  a  good  accuracy  by  setting  J  to  a  sufficiently  large 
value,  it  would  require  unacceptable  computation  time  for  such  a  large  J.  Overall,  the 
proposed  algorithm  is  about  3  times  more  accurate  in  the  fastest  case  for  the  naive 
method  (in  the  case  of  the  Coauthorship  network  under  J  =  5)  and  about  100  times 
faster  in  its  most  accurate  case  (in  the  case  of  the  Wikipedia  network  under  J  =  20).  Fi¬ 
nally,  we  illustrate  the  actual  behavior  of  ||</(S)||  derived  from  an  information  diffusion 
result  for  the  blog  network  under  the  AsIC  model  in  Fig.  8a,  where  ||g(5)||  is  depicted 
as  a  function  of  the  ending  point  tj  of  S  when  its  starting  point  is  fixed  to  a  certain 
value.  We  can  see  the  blue  broken  curve  showing  ||g([0,  tj))\\  has  two  peaks  at  around 
tj  =  10  and  tj  =  20,  which  are  the  starting  and  ending  points  of  the  true  hot  span, 
respectively.  This  means  that  the  sign  of  dC(p,  r;V(0,T))/dpUtV  reversed  at  these  time 
points  as  explained  in  Section  5.2  7 .  Thus,  the  red  solid  curve  showing  ||g([10,  tj))\\  has 
only  one  peak  at  around  tj  =  20,  which  is  the  global  maximum  among  all  the  possible 
||g(S)||.  Thanks  to  Eq.(17),  the  proposed  method  can  efficiently  calculate  the  behavior 
of  ||<7(5)  ||,  and  thus  can  find  out  the  hot  span  more  accurately  and  more  efficiently  than 
the  naive  method  does. 

In  summary,  we  can  say  that  the  proposed  method  can  detect  and  estimate  the  hot 
span  and  diffusion  probabilities  for  the  AsIC  model  much  more  accurately  and  effi¬ 
ciently  compared  with  the  naive  method. 
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(a)  AsIC  Model  (b)  VwV  Model 

Fig.  8:  Change  of  1 1 <7 f .S' )  1 1  when  given  a  fixed  starting  point  of  a  time  span  S  for  a 
diffusion  result  retrieved  from  the  blog  network  under  the  respective  experimental 
setting  for  each  information  diffusion  model. 


6.4.  Results  for  Voter  Model 

Figures  9  to  12  show  the  experimental  results  for  the  VwV  model.  Similarly  to  the 
results  for  the  AsIC  model,  from  these  results,  we  can  find  that  the  proposed  method  is 
much  more  accurate  than  the  naive  method  for  every  network.  Again,  the  average  error 
for  the  naive  method  decreases  as  J  becomes  larger.  But,  even  in  the  best  case  for  the 
naive  method  ( J  =  20),  its  average  error  in  the  estimation  of  the  hot  span  is  maximum 
about  30  times  larger  than  that  of  the  proposed  method  (in  the  case  of  the  Enron 
network  under  K  =  2)  as  shown  in  Fig.  9,  and  it  is  maximum  about  6  times  larger 
in  the  estimation  of  opinion-values  (in  the  case  of  the  Coauthorship  network  under 
K  =  2)  as  shown  in  Fig.  10.  Figure  11  shows  that  the  proposed  method  is  better  than 
the  naive  method  in  the  integrated  estimation  accuracy  for  every  case.  It  is  noted  that 
the  naive  method  needs  much  longer  computation  time  to  achieve  these  best  accuracies 
than  the  proposed  method  as  shown  in  Fig.  12  despite  that  the  number  of  candidate 
time  points  for  the  naive  method  is  50  times  smaller.  Indeed,  it  is  about  20  times  longer 
in  the  case  of  the  Enron  network  under  K  =  2,  about  13  times  longer  in  the  case  of 
the  Coauthorship  network  under  K  =  2,  and  maximum  about  95  times  longer  for  the 
whole  results  (in  the  case  of  the  Enron  network  under  K  =  8).  Overall,  the  proposed 
method  is  about  7  times  more  accurate  in  the  fastest  case  for  the  naive  method  (in 
the  case  of  the  blog  network  under  I\  =  2  and  J  =  5)  and  about  13  times  faster  in 
its  most  accurate  case  (in  the  case  of  the  Coauthorship  network  under  K  =  2  and 
J  =  20).  Figure  8b  shows  the  behavior  of  ||g(S)||  derived  from  an  opinion  diffusion 
result  for  the  blog  network  under  the  VwV  model.  Similarly  to  the  case  of  the  AsIC 
model,  it  is  found  that  the  blue  broken  curve  showing  ||g([0,t)-))||  has  two  peaks  at 
around  tj  =  10  and  tj  =  15,  which  are  the  starting  and  ending  points  of  the  true  hot 
span,  respectively.  In  this  case,  the  red  solid  curve  starting  from  t3  =  10  has  only  one 
peak  at  around  t3  =  15,  which  becomes  the  global  maximum  among  all  the  possible 
119(5)11.  The  proposed  method  can  find  out  the  time  span  that  results  in  the  global 
maximum  from  a  set  of  the  candidate  time  points  efficiently  for  the  VwV  model,  too. 


7Since  in  this  case  the  partial  derivative  is  a  scalar,  it  suffices  to  say  its  sign. 
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Fig.  9:  Comparison  of  the  accuracy  in  the  estimated  hot  span  for  the  VwV  model. 
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Fig.  10:  Comparison  of  the  accuracy  in  the  estimated  opinion-value  vector  for  the  VwV 
model. 
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Fig.  11:  Comparison  of  the  integrated  estimation  error  for  the  VwV  model. 
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Fig.  12:  Comparison  of  the  computation  time  for  the  VwV  model. 


From  these  results,  it  can  be  concluded  that  the  proposed  method  is  able  to  detect 
and  estimate  the  hot  span  and  opinion-values  for  the  VwV  model  much  more  accu¬ 
rately  and  efficiently  compared  with  the  naive  method. 
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7.  DISCUSSION 

The  results  in  the  previous  section  indicates  that  the  proposed  approach  works  as  in¬ 
tended  to  both  AsIC  and  VwV  diffusion  models.  Although  we  believe  that  the  approach 
is  generic,  it  has  yet  to  be  verified  whether  the  approach  is  applicable  to  any  other 
model  in  so  far  as  it  is  formulated  as  a  probabilistic  diffusion  model. 

We  placed  a  simplifying  constraint  that  the  parameters  pu,v  and  ru>v  of  the  AsIC 
model  do  not  depend  on  link  (u,v),  i.e.,  pu>v  =  p,  ru<v  =  r  (V(u,v)  €  E),  by  focusing  on 
single  topic  diffusion  sequences.  Our  previous  experiments  [Saito  et  al.  2009b;  2010a; 
2010b]  give  some  evidences  which  support  the  validity  of  this  constraint.  We  exam¬ 
ined  7, 356  diffusion  sequences  for  a  real  blogroll  network  containing  52, 525  bloggers 
and  115,552  blogroll  links,  and  have  experimentally  confirmed  that  the  diffusion  and 
time-delay  parameters  that  were  learned  from  different  diffusion  sequences  belonging 
to  the  same  topic  were  quite  similar  for  most  of  the  topics.  This  observation  naturally 
suggests  that  people  behave  quite  similarly  for  the  same  topic.  On  the  other  hand, 
our  recent  study  indicates  that  these  parameters  can  be  learned  by  assuming  their 
functional  dependency  on  the  neighboring  node  attributes  [Saito  et  al.  2011b].  We  can 
extend  this  approach  to  augment  the  attributes  to  include  the  node  independent  exter¬ 
nal  factor.  This  way  the  uniformity  assumption  can  be  removed.  We  have  considered 
only  the  AsIC  as  a  model  of  general  information  diffusion,  but  it  is  straightforward  to 
apply  the  same  technique  to  the  AsLT  model  [Saito  et  al.  2010b]  and  the  SIS  (suscep¬ 
tible/infectious/susceptible)  versions  of  both  the  models  in  which  each  node  is  allowed 
to  be  activated  multiple  times. 

The  change  pattern  we  used  is  also  very  simple.  We  assumed  that  the  parameters 
of  all  nodes  and  links  change  instantaneously  and  simultaneously  in  the  same  de¬ 
gree  and  stay  the  same  during  a  given  hot  span.  We  can  assume  a  more  intricate 
problem  setting  such  that  pUjV  (for  AsIC),  wu  (for  VwV)  and  r,M,  (for  both)  change 
for  multiple  distinct  hot  spans  and  the  shape  of  change  pattern  pu<v  and  wu  are  not 
necessarily  rect-linear.  One  possible  extension  is  to  approximate  the  pattern  of  any 
shape  by  J  pairs  of  time  interval  each  with  its  corresponding  pu,v,j  an(i  wu,j,  fe., 
Zj  =  {( Pj ,  j  =  1,  •  •  •  J}  (t0  =  0,  tj  =  oo)  and  use  a  divide-and-conquer  type 

greedy  recursive  partitioning,  still  employing  the  derivative  of  the  likelihood  function 
as  the  main  measure  for  search.  For  brevity  we  drop  the  u,  v  dependency  and  consider 
only  the  AsIC  model.  More  specifically,  we  first  initialize  Zi  =  {(p\ .  ([0,  oo))}  where 
Pi  is  the  maximum  likelihood  estimator,  and  search  for  the  first  change  time  point  t\ , 
which  we  expect  to  be  the  most  distinguished  one,  by  maximizing  ||g(S)||  that  uses  in 
as  6  for  the  whole  span  [0,  oo).8  We  recursively  perform  this  operation  J  times  by  fixing 
the  previously  determined  change  points.  When  to  stop  can  be  determined  by  a  statis¬ 
tical  criterion  such  as  AIC  or  MDL.  This  algorithm  requires  parameter  optimization  J 
times.  Figure  13  is  one  of  the  preliminary  results  obtained  for  two  distinct  rect-linear 
patterns  using  five  sequences  in  case  of  the  blog  network.  MDL  is  used  as  the  stopping 
criterion.  The  change  pattern  of  p  is  almost  perfectly  detected  with  respect  to  both  pj 
and  tj  ( J  =  5).  We  might  further  want  to  introduce  some  stochastic  natures  into  the 
model  for  some  external  factors  that  affect  parameter  changes  reflecting  the  fact  that 
each  individual’s  response  to  the  external  factors  is  different,  i.e.,  some  people  respond 
quickly  and  others  slowly. 

The  change  we  considered  is  only  in  the  time  domain  and  we  assumed  that  there  is 
no  spatially  local  change.  We  can  consider  a  more  general  setting,  i.e.,  spatio-temporal 
change  in  parameter  values.  We  need  a  more  elaborate  algorithm  to  cope  with  this 
extension  but  the  basic  approach  of  using  the  first  derivative  of  the  likelihood  function 


8  Note  that  the  total  sum  of  g  =  0. 
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Fig.  13:  Information  diffusion  in  the  blog  network  with  two  hot  spans  for  the  AsIC 
model. 

remains  valid.  We  assumed  that  the  network  structure  is  stationary  although  we  in¬ 
troduced  the  change  in  the  parameter  value.  The  model  we  used  does  not  account  for 
the  structure  change  by  itself.  However,  once  the  structure  change  is  known,  i.e.,  ad¬ 
dition/deletion  of  nodes  and  links  at  each  time  instance,  it  is  straightforward  to  apply 
the  proposed  algorithm  to  these  changes  because  the  dynamics  of  a  node  is  determined 
by  the  interaction  with  its  neighbors,  i.e.,  local  structure  of  the  network. 

8.  CONCLUSION 

In  this  paper,  we  addressed  the  problem  of  detecting  changes  in  behavior  of  informa¬ 
tion  diffusion  over  a  social  network  which  is  caused  by  changes  in  unknown  external 
factors  from  a  limited  amount  of  observed  diffusion  sequences  in  a  retrospective  set¬ 
ting.  The  information  diffusion  process  is  described  by  a  probabilistic  model  with  some 
parameters  that  characterize  the  behavior,  and  the  change  in  unknown  external  fac¬ 
tors  is  assumed  to  be  effectively  reflected  in  changes  in  the  parameter  values  in  the 
model.  We  called  the  period  where  the  parameter  takes  anomalous  values  as  “hot  span” 
and  the  rest  as  “normal  span”,  and  the  problem  is  reduced  to  detecting  the  hot  span, 
i.e. ,  identifying  the  time  window  where  the  parameter  value  is  anomalous  and  estimat¬ 
ing  the  parameter  values  both  in  the  hot  and  normal  spans.  We  solved  this  problem  by 
searching  the  time  window  that  maximizes  the  likelihood  of  generating  the  observed 
information  diffusion  sequences.  Our  main  contribution  is  that  we  devised  a  very  effi¬ 
cient  general  iterative  search  algorithm  which  is  robust  and  applicable  to  a  wide  class 
of  probabilistic  information  diffusion  models.  The  algorithm  uses  the  first  derivative  of 
the  likelihood  with  respect  to  the  parameters,  uses  it  in  the  window  search  (outer  loop) 
and  avoids  parameter  value  optimization  during  the  search  (inner  loop).  It  only  needs 
to  estimate  the  parameter  value  twice  (at  the  first  and  the  final  steps  of  the  search). 
This  is  in  contrast  to  the  naive  learning  algorithm  which  has  to  iteratively  update 
the  pattern  boundaries  (outer  loop),  each  requiring  the  parameter  value  optimization 
to  maximize  the  likelihood  for  the  candidate  window  (inner  loop),  which  is  very  in¬ 
efficient  and  totally  unacceptable.  We  showed  that  the  algorithm  works  satisfactorily 
well  for  two  instances  of  the  probabilistic  information  diffusion  model  which  has  dif¬ 
ferent  characteristics:  asynchronous  independent  cascade  (AsIC)  model  as  a  model  of 
information  push  style  and  value-weighted  voter  (VwV)  model  as  a  model  of  informa- 
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tion  pull  style.  The  AsIC  is  the  model  for  general  information  diffusion  with  binary 
states  and  the  parameter  to  detect  its  change  is  diffusion  probability  and  the  VwV  is 
the  model  for  opinion  formation  with  multiple  states  and  the  parameter  to  detect  its 
change  is  opinion  value.  The  results  tested  on  these  two  models  using  four  real  world 
network  structures  and  a  single  rect-linear  change  confirmed  that  the  algorithm  is  ro¬ 
bust  enough  and  can  efficiently  identify  the  correct  change  pattern  of  the  parameter 
values.  Comparison  with  the  naive  method  that  finds  the  best  combination  of  change 
boundaries  by  an  exhaustive  search  through  a  set  of  randomly  selected  boundary  can¬ 
didates  showed  that  the  proposed  algorithm  far  outperforms  the  native  method  both 
in  terms  of  accuracy  (about  3  times  more  accurate  for  the  AsIC  model  and  about  7 
times  accurate  for  the  VwV  model  in  the  fastest  case  for  the  naive  method)  and  com¬ 
putation  time  (about  100  times  faster  for  the  AsIC  model  and  about  13  times  faster 
for  the  VwV  model  in  the  most  accurate  case  for  the  naive  method).  The  problem  set¬ 
ting  we  assumed  in  this  paper  is  very  simple,  but  we  expect  that  the  proposed  method 
can  be  easily  extended  to  solve  more  intricate  problems.  We  showed  one  possible  direc¬ 
tion  and  the  preliminary  result  obtained  for  two  rect-linear  shape  hot  spans  was  very 
promising.  Our  immediate  future  work  is  to  evaluate  our  method  using  real  world  in¬ 
formation  diffusion  samples  with  hot  spans,  as  well  as  to  deal  with  spatio-temporal 
hot  span  detection  problems  using  more  appropriate  stochastic  models  under  a  similar 
problem  solving  framework. 

APPENDIX 

A.  ESTIMATION  ALGORITHM  FOR  ASIC  MODEL 

We  briefly  describe  the  estimation  algorithm  of  parameters  p  and  r  for  the  AsIC  model 
from  a  sequence  of  observed  data  X>(0 ,  T)  (see  [Saito  et  al.  2009b;  2010a]  for  more 
details). 

We  employ  an  EM-like  algorithm.  Let  /l  and  f  be  the  current  estimates  of  p  and  r. 
Using  Eqs.  (1)  and  (2),  we  define  aUtV  and  as  follows: 

_  Xu,v(p,r)/yu>v{p,f) 

au’v  EzeA„XZAP,r)/yz,v(p,f) 
a  _  pexp(-r(tv  -  tu)) 

Pu'v  ~  yu,v{p,f) 

The  update  formulas  of  p  and  r  are  as  follows: 

_  (®v,v  +  (1  —  &u,v)Pv,v) 

\{(u,v)  €  E;  u  £V}\ 

_  _ 'Ey gP  ®U<V _ 

Stig'D  ^iuGAv  Au>v  T  (1  £u,v)  Pu.v)  (tv  ty) 

B.  ESTIMATION  ALGORITHM  FOR  VWV  MODEL 

We  briefly  describe  the  estimation  algorithm  of  parameter  vector  w  for  the  VwV  model 
from  an  observed  data  V(0 ,T)  (see  [Kimura  et  al.  2010b]  for  more  details).  As  men¬ 
tioned  in  Subsection  3.2,  the  opinion  dynamics  for  the  VwV  model  is  invariant  to  posi¬ 
tive  scaling  of  w.  Thus,  we  transform  the  parameter  vector  w  by  w  =  w(z),  where 

w(z)  =  (exp(zi),  •  •  • ,  exp(zK-i),  1),  (z  =  (zi,  ■  ■  ■ ,  zK-i)  G  RA_1^)  . 

Namely,  our  problem  is  to  estimate  the  value  of  z  that  maximize  £.(w(z):  'D(0.  T)). 
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Then,  for  any  i,j  e  {1,  •  •  • ,  K  —  1},  we  obtain 


ac(W(z);x>(o,r)) 

dzi 

d2£(w(z);V(0,T)) 
dzi  dzj 


X  (4,*  ~  qi(t,v )), 

(v,t,k)eC(0,T) 

X  (' Qi(t,v)qj{t,v )  -  6itjqi(t,v )), 

(• v,t,k)eC(0,T ) 


where  Sij  is  the  Kronecker’s  delta,  and 

rii(t,v)  exp(zi) 

qi(t,v)  =  - j- - . 

nK(t,v)  +  ne(t,v)  exp(^) 

We  can  show  that  the  Hessian  matrix  (d2  £(w(z)\V(0,T))  /  dzidzj)  is  negative  semi- 
definite.  Hence,  by  solving  the  equations  d£(w(z);  V(0,  T))/dZi  =  0,  (i  =  1,  •  •  • ,  K  —  1), 
we  can  find  the  value  of  z  that  maximizes  £(w(z);V(0,T)).  We  employed  a  standard 
Newton  Method  in  our  experiments. 
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Abstract.  We  analyzed  three  functions  of  Twitter  (Favorite,  Follow  and 
Mention)  from  network  structural  point  of  view.  These  three  functions 
are  characterized  by  difference  and  similarity  in  various  measures  de¬ 
fined  in  directed  graphs.  Favorite  function  can  be  viewed  by  three  differ¬ 
ent  graph  representations:  a  simple  graph,  a  multigraph  and  a  bipartite 
graph,  Follow  function  by  one  graph  representation:  a  simple  graph,  and 
Mention  function  by  two  graph  representations:  a  simple  graph  and  a 
multigraph.  We  created  these  graphs  from  three  real  world  twitter  data 
and  found  salient  features  characterizing  these  functions.  Major  findings 
are  a  very  large  connected  component  for  Favorite  and  Follow  functions, 
scale-free  property  in  degree  distribution  and  predominant  mutual  links 
in  certain  network  motifs  for  all  three  functions,  freaks  in  Gini  coefficient 
and  two  clusters  of  popular  users  for  Favorites  function,  and  a  structure 
difference  in  high  degree  nodes  between  Favorite  and  Mention  functions 
characterizing  that  Favorite  operation  is  much  easier  than  Mention  op¬ 
eration.  These  finding  will  be  useful  in  building  a  preference  model  of 
Twitter  users. 


1  Introduction 

Grasping  and  controlling  preference,  tendency,  or  trend  of  the  consuming  public 
is  one  of  the  important  factors  to  achieve  economical  success.  Accordingly,  it  is 
vital  to  collect  relevant  data,  analyze  them  and  model  user  preference.  However, 
quantifying  preference  is  very  difficult  to  achieve  and  finding  useful  measures 
from  the  network  structure  is  crucial.  The  final  goal  of  this  work  is  to  find  such 
measures,  characterize  their  relations  and  build  a  reliable  user  preference  model 
based  on  these  measures  from  the  available  data.  As  the  very  first  step,  we  focus 
on  Twitter  data  and  analyzes  the  user  behavior  of  three  functions  (Favorite, 
Follow  and  Mention)  of  Twitter  1  from  the  network  structural  point  of  view,  i.e. , 
by  using  various  measures  that  have  been  known  useful  in  the  graph  theory  and 

1  http://twitter.com/ 
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identifying  characteristic  features  (difference  and  similarity)  of  these  measures 
for  these  functions. 

User  behavior  of  these  three  functions  are  represented  by  different  directed 
graphs.  Favorite  function  can  be  viewed  by  three  different  graph  representations: 
a  simple  graph,  i.e.,  single  edge  from  a  Favorer  to  a  Favoree,  a  multigraph,  i.e. , 
multiple  edges  from  a  Favorer  to  a  Favoree,  and  a  bipartite  graph,  i.e.,  single  edge 
from  a  Favorer  to  a  Favoree  treating  a  user  with  both  a  Favorer  and  a  Favoree  as 
two  separate  nodes.  Likewise,  Follow  function  can  be  viewed  by  one  graph  rep¬ 
resentation:  a  simple  graph,  i.e.,  single  edge  from  a  Follower  and  a  Followee,  and 
Mention  function  can  be  viewed  by  two  different  graphs:  a  simple  graph,  i.e.  sin¬ 
gle  edge  from  a  Mentioner  (sender)  to  a  Mentionee  (receiver)  and  a  multigraph, 
i.e.  multiple  edges  from  a  Mentioner  to  a  Mentionee.  We  have  created  these  net¬ 
works  from  three  different  Twitter  logs  (called  ’(.Favorites  network”,  ’(Followers 
network”,  and  (Mentions  network”)  and  used  several  different  measures,  e.g. 
in-degree,  out-degree,  multiplicity,  Gini  coefficient,  etc.  Extensive  experiments 
were  performed  and  several  salient  features  were  found.  Major  findings  are  that 
1)  Favorites  and  Followers  networks  have  a  very  large  connected  component  but 
Mentions  network  is  not,  2)  all  the  three  networks  (both  simple  and  multiple) 
have  the  scale-free  property  in  degree  distribution,  3)  all  three  networks  (simple) 
have  predominant  three-node  motifs  having  mutual  links,  4)  Favorites  network 
have  freaks  in  Gini  coefficient  (one  of  the  measures),  5)  Favorites  network  have 
two  clusters  of  popular  users,  and  6)  Favorites  and  Mentions  networks  differ  in 
structure  for  high  degree  nodes  reflecting  that  Favorite  operation  is  much  easier 
than  Mentions  operation.  We  analyse  simple  graph  and  bipartite  graph  with 
conventional  methods,  and  multigraph  with  new  proposal  method  using  Gini 
coefficient,  the  index  which  measures  the  inequality  among  values  of  a  frequency 
distribution.  Twitter,  a  microblogging  service,  has  attracted  a  great  deal  of  at¬ 
tention  and  various  properties  have  already  been  obtained  [3]  [4],  but  to  our 
knowledge,  there  have  been  no  work  to  analyze  the  user  behavior  from  network 
structural  point  of  view.  We  believe  that  the  work  along  this  line  will  be  useful 
in  understanding  the  user  behavior  and  helps  building  a  preference  model  of 
Twitter  users. 

The  paper  is  organized  as  follows.  We  briefly  explain  the  various  measures 
we  adopted  in  our  analysis  in  2,  three  networks  (  Favorite,  Follow,  and  Mention) 
in  3.  Then  we  report  the  experimental  results  in  4  and  provide  some  discussions 
regarding  our  observations  in  5.  We  end  this  paper  by  summarizing  the  major 
finding  and  mentioning  the  future  work  in  6. 


2  Analysis  Methods 

According  to  [1],  we  define  the  structure  of  a  network  as  a  graph.  A  graph 
G  =  ( V ,  E )  consists  of  a  set  V  of  nodes  (vertices)  and  a  set  E  of  links  (edges) 
that  connect  pairs  of  nodes.  Note  that  in  our  Favorites,  Followers  or  Mentions 
network,  a  node  corresponds  to  a  Twitter  user,  and  a  link  corresponds  to  favor¬ 
ing,  following,  or  mentioning  between  a  pair  of  users.  If  two  nodes  are  connected 
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by  a  link,  they  are  adjacent  and  we  call  them  neighbors.  In  directed  graphs,  each 
directed  link  has  an  origin  (source)  and  a  destination  (target) .  A  link  with  origin 
u  £  V  and  destination  v  €  V  is  represented  by  an  ordered  pair  (u,v).  A  directed 
graph  G  =  (V,  E)  is  called  a  bipartite  graph,  if  V  is  divided  into  to  two  parts, 
Vx  and  Vy,  where  V  =  Vx  U  Vy,  Vx  (~l  Vy  =  0,  and  E  c  {(u,v)',u  £  Vx,v  £  Vy. 
In  directed  graphs,  we  may  allow  the  link  set  E  to  contain  the  same  link  several 
times,  i.e.,  E  can  be  a  multiset.  If  a  link  occurs  several  times  in  E ,  the  copies 
of  that  link  are  called  parallel  links.  Graphs  with  parallel  links  are  also  called 
multigraphs.  A  graph  is  called  simple,  if  each  of  its  links  is  contained  in  E  only 
once,  i.e.,  if  the  graph  does  not  have  parallel  links.  In  what  follows,  we  describe 
our  analysis  methods  for  each  type  of  graphs. 


2.1  Methods  for  Simple  Graph 


A  graph  G'  =  (V7,  E')  is  a  subgraph  of  the  graph  G  =  ( V ,  E)  if  V'  £  V  and  E'  £ 
E.  It  is  an  induced  subgraph  if  E'  contains  all  links  e  £  E  that  connect  nodes 
in  V' .  A  directed  graph  G  =  (V,E)  is  strongly  connected  if  there  is  a  directed 
path  from  every  node  to  every  other  node.  A  strongly  connected  component  of  a 
directed  graph  G  is  an  induced  subgraph  that  is  strongly  connected  and  maximal. 
A  bidirected  graph  G  =  ( V ,  E)  is  constructed  from  a  directed  graph  G  =  (V,  E) 
by  adding  counterparts  of  the  unidirected  links,  i.e.,  E  =  EU{(v,  u);  ( u ,  v)  £  E}. 
A  weakly  connected  component  of  a  directed  graph  G  is  an  induced  subgraph 
from  V'  obtained  as  a  strongly  connected  component  of  the  bidirected  graph 
G.  We  analyze  the  structure  of  our  networks  in  terms  of  the  connectivity  using 
these  notions. 

In  a  directed  graph  G  =  (V,  E),  the  out-degree  of  v  £  V,  denoted  by  d+(v), 
is  the  number  of  links  in  E  that  have  origin  v.  The  in-degree  of  v  £  V,  denoted 
by  dr  { v),  is  the  number  of  links  with  destination  v.  The  average  degree  d  is 
calculated  by 


d  = 


1 
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d  (v) 

vev 


1 

M 


d+ (v) 

vev 


M 


a) 


Here  |  •  |  stands  for  the  number  of  elements  for  a  given  set.  The  correlation 
between  in-  and  out-degree,  denoted  by  c,  is  calculated  by 


y(d  (v)  ~  d)(d+ (y)  -  d) 
VT,vev(d~(v)  -  d)2VT,vev(d+(v)  -  d )2 


(2) 


On  the  other  hand,  the  in-degree  distribution  id(k)  and  the  out-degree  distribu¬ 
tion  od(k)  with  respect  to  degree  k  are  respectively  defined  by 


id(k)  =  | {v  £V;d  (v)  =  k} |,  od(k)  =  |{v  e  V\d+(v)  =  k}\.  (3) 


We  analyze  the  statistical  properties  of  these  degree  distributions. 

Network  motifs  are  defined  as  patterns  of  interconnections  occurring  in  graphs 
at  numbers  that  are  significantly  higher  than  those  in  randomized  graphs.  In  our 
analysis,  we  focus  on  three-node  motifs  patterns  and  Figure  1  shows  all  thirteen 
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Fig.  1:  Network  motifs  patterns 


types  of  three-node  connected  subgraphs  (motifs  patterns) .  According  to  [5] ,  we 
also  use  randomized  graphs,  each  node  of  which  has  the  same  in-degree  and 
out-degree  as  the  corresponding  node  has  in  the  real  network  [6] .  A  significance 
level  of  each  motifs  pattern  i  is  evaluated  by  its  z-score  Zi ,  i.e. , 


Zi  = 


fi  J  1^2j=l9j,i 


J  1  ELCft  -  J  1  E'Li  9j,i)2 


(4) 


where  J  is  the  number  of  randomized  graphs  used  for  evaluation,  and  fi  and 
denote  the  numbers  of  occurrences  of  motifs  pattern  i  in  the  real  graph  and 
the  j-th  randomized  graph,  respectively.  By  this  motifs  analysis,  we  attempt  to 
uncover  the  basic  building  blocks  of  our  networks. 


2.2  Visualization  of  Bipartite  Graph 

We  can  construct  a  bipartite  graph  from  a  directed  graph  by  setting  Vx  = 
{u-,(u,v)  £  E}  and  Vy  =  {v;(u,v)  £  E},  and  regarding  that  any  element  in 
Vx  is  different  from  any  element  in  Vy.  Further,  according  to  [2],  we  describe  a 
bipartite  graph  visualization  method  for  our  analysis.  For  the  sake  of  technical 
convenience,  each  set  of  the  nodes,  Vx  and  Vy,  is  identified  by  two  different  series 
of  positive  integers,  i.e.,  Vx  =  {1,  •  •  • ,  m,  ■  ■  ■ ,  M}  and  Vy  =  {1,  ■  •  • ,  n,  ■  ■  ■ ,  TV}. 
Here  M  and  TV  are  the  numbers  of  the  nodes  in  Vx  and  Vy  ,  i.e.,  \VX\  =  M  and 
\Vy\  =  V,  respectively.  Then,  the  M  x  N  adjacency  matrix  A  =  {am^n}  is  de¬ 
fined  by  setting  am,n  =  1  if  (to,  n)  £  E;  am,n  =  0  otherwise.  The  L-dimensional 
embedding  position  vectors  are  denoted  by  xm  for  the  node  m  £  Vx  and  y„  for 
the  node  n  £  Vy.  Then  we  can  construct  M  x  L  and  TV  x  L  matrices  consisting  of 
these  position  vectors,  i.e.,  X  =  (xi,  •  •  •  xu)T  and  Y  =  (yi,  •  •  •  yn)T ■  Here  XT 
stands  for  the  transposition  of  X.  Hereafter,  we  assume  that  nodes  in  subset  Vx 
are  located  on  the  inner  circle  with  radius  rx  =  1 ,  while  nodes  in  Vy  are  located 
on  the  outer  circle  with  radius  ry  =  2.  Note  that  ||xm||  =  1,  ||y„||  =  2. 

The  centering  (Young-Householder  transformation)  matrices  are  defined  as 
H M  =  I m  ~  jImIm,  Hat  =  IN  -  jflwljf  where  IM  and  Ijv  stands  for 
M  x  M  and  TV  x  TV  identity  matrices,  respectively,  and  1  m  and  l^r  are  M- 
and  TV-dimensional  vectors  whose  elements  are  all  one.  By  using  the  double- 
centered  matrix  B  =  {&m,n}  that  is  calculated  from  the  adjacency  matrix  A  as 
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B  =  HmAHjv,  we  can  consider  the  following  objective  function  with  respect  to 
the  position  vectors  X  =  (xl5  •  •  • ,  xM)T  and  Y  =  (yi,  •  •  • ,  yjv)T. 


M  N 


S(X, 


Xm  y  n 


1 


M 


+  2 

m=l 


R*  -  x 


T 

rn^rn 


1  N 

)  +  2  Vn(rl  -  y. 

n—1 


2  "TV  1 
nJ  n) 


(5) 


where  {Am  |  to  =  1,  •  •  • ,  M}  and  {/ in  \  n  =  1,  •  •  • ,  N}  correspond  to  Lagrange 
multipliers  for  the  spherical  constraints,  i.e.,  x^xm  =  r\  and  y^y „  =  for 
1  <  m  <  M  and  1  <  n  <  N.  By  maximizing  S(X.,  Y)  defined  in  Equation  (5), 
we  can  obtain  our  visualization  results,  X  and  Y  for  a  given  bipartite  graph. 


2.3  Methods  for  Multigraph 

For  multigraphs,  we  denotes  the  number  of  links  from  node  u  to  v,  i.e.,  (u,v), 
as  mU'V.  Note  that  favoring  or  mentioning  between  a  pair  of  users  may  occur 
several  times  during  the  observed  period.  We  also  denote  an  in-neighbor  node 
set  of  node  v  by  A(v)  =  {u;mUtV  yf  0},  and  an  out-neighbor  node  set  of  node  v 
by  B(v)  =  {w;mVyW  y^  0}.  Then  we  can  consider  a  node  set  C(k)  =  {i>;  |A(u)|  = 
k}  for  which  the  number  of  in-neighbor  nodes  is  k,  and  a  node  set  D(k)  = 
{v;  |B(u)|  =  k}  for  which  the  number  of  out-neighbor  nodes  is  k.  Thus,  by  using 
these  notations,  with  respect  to  the  number  of  neighbors  k,  we  can  define  the 
in-neighbor  distribution  id(k)  and  the  out-neighbor  distribution  od(k)  as  follows: 

in(k )  =  |C(/c)|,  on(k)  =  \D(k)\.  (6) 


Note  that  in  case  of  simple  directed  graphs,  the  in-  and  out-neighbor  distributions 
are  simply  called  the  in-  and  out-degree  distributions,  respectively. 

Now,  we  define  a  set  of  nodes  whose  in-degree  are  not  zero  by  V~  =  {u  G 
V:deq~(v)  >  0},  and  a  set  of  nodes  whose  out-degree  are  not  zero  by  V+  = 
{v  G  V\deg+{v)  >  0}. 

Then,  we  can  define  the  average  in-multiplicity  m~(v)  for  v  G  V~  and  the 
average  out-multiplicity  m+{v)  for  v  G  V+  as  follow: 


to  (v) 


1 


y  \  mu,vi 

uGA(v) 


m+(v) 


1 


y  \  to.^j  w- 

w€iB(v) 


(7) 


For  a  multigraph,  we  can  define  the  average  in-multiplicity  m  and  the  average 
out-multiplicity  m+  as  follow: 

TO"  =  R—  Y  m+  =  Y  m+(v)-  (8) 

'  '  vev-  '  '  vGV+ 


On  the  other  hand,  with  respect  to  number  of  neighbors  k{>  1),  we  can  define 
the  average  link  multiplicity  im(k)  for  a  node  set  C(k),  and  the  average  link 
multiplicity  om{k )  for  a  node  set  D{k)  as  follows: 


im(k) 


1 


'Y  m+(v). 

v£D(k) 


(9) 


\D(k)\ 
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Similarly,  for  each  node  v  £  V,  we  can  define  the  in-Gini  coefficient  g  (v)  for 
v  £  V~  and  the  out-Gini  coefficient  g+(v )  for  v  £  V+  as  follow: 


9  {v) 


E(m,x)€A(d)xA( v)  I mu,v  mx,x| 

2{\A(v)\  -  l)E„e A(v)mu,v 


9+{v) 


E(w,x)eB(»)xB(»)  \mv,w  TOx,x| 
2(|-B(t;)|  —  1)  Euig-B(d)  mv,w 


(10) 

For  a  multigraph,  we  can  define  the  average  in-multiplicity  m  and  the  average 
out-multiplicity  ?n+  as  follow: 
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rR 


9  (v), 

vev- 


9+  = 
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W+\ 


9+(v)- 

v€V+ 


(ii) 


With  respect  to  number  of  neighbors  k(>  1),  we  can  define  the  average  Gini 
coefficient  ig(k)  for  a  node  set  C(k),  and  the  average  Gini  coefficient  og{k)  for  a 
node  set  D(k)  as  follows: 

(")’  (12) 


Here  note  that  the  gini  coefficient  has  been  widely  used  for  evaluating  inequality 
in  a  market  [7].  We  use  this  index  to  evaluate  inequality  between  favoring  and 
mentioning. 


3  Summary  of  Data 

We  briefly  explain  the  data  we  used  in  our  analysis.  These  data  are  retrieved 
from  Favorite,  Follow,  and  Mention  of  Twitter. 

’’Favorites”  is  a  function  which  enables  users  to  bookmark  tweets,  or  to 
browse  them  anytime.  We  constructed  a  network  with  the  users  as  nodes,  and 
the  Favorer/Favoree  relations  as  links.  These  data  are  retrieved  from  Favotter’s 
’’Today’s  best.”  2  during  the  period  from  May  1st  2011  to  February  12th  2012. 
Because  of  Favotter’s  specification,  the  retrieved  tweets  are  bookmarked  by  more 
than  or  equal  to  5  users.  This  directed  network  has  189,717  nodes,  7,077,070  sim¬ 
ple  links,  and  33,456,690  multiple  links3. 

’’Follow”  is  the  most  basic  function  of  Twitter.  Users  can  get  the  new  tweets 
posted  by  persons  they  are  interested  in  by  specifying  whom  to  follow.  We  con¬ 
structed  a  network  with  users  who  posted  more  than  or  equal  to  200  tweets 
as  nodes,  and  the  follower/followee  [3]  relations  as  links.  These  data  are  re¬ 
trieved  from  Twitter  search  4  as  of  January  31st  2011.  This  directed  network 
has  1,088,040  nodes  and  157,371,628  simple  links.  Follow  network  does  not  have 
multiple  links  because  users  specify  their  respective  followers  only  once. 

2  http://favotter.net/ 

3  The  number  of  simple  links  means  that  we  count  the  multiple  links  between  a  pair 
of  nodes  as  a  single  link. 

4  http://yats-data.com/yats/ 
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’’Mentions”  are  tweets  which  has  the  user’s  names  of  the  form  ’’©Screen  name” 
in  the  text.  We  constructed  a  network  with  users  as  nodes,  and  send/receive  re¬ 
lations  as  links.  These  data  are  retrieved  from  Toriumi’s  data  [8]  for  the  period 
from  March  7th  2011  to  March  23rd  2011.  This  directed  network  has  4,565,085 
nodes,  58,514,337  simple  links  and  193,913,339  multiple  links. 

Statistics  of  these  networks  are  described  for  Tables  1  and  2.  Here,  WCC1  in 
Table  1  means  the  maximal  weakly  connected  components,  Em  in  table  2  means 
the  number  of  multiple  links.  Others  are  defined  in  section  2. 


Table  1:  Statistics  of  simple  directed  networks 


\v\ 

\E\ 

\V\wcci  (\V\wcci/\V\)  d  c 

Favorites 

Follow 

Mentions 

189,717  7,077,070 

1,088,040  157,371,628 
4,565,085  58,514,337 

189,626  (99.9%)  37.3  0.2109 

1,079,986  (99.3%)  144.6  0.7354 

1,839,189  (40.3%)  3.2  0.0387 

Table  2: 

statistics  of  multi  directed  networks 

|  |V| 

\Em\ 

d  m  m+  g  g+ 

Favorites 

Mentions 

189,717  33,456,690 
4,565,085  193,913,339 

176.3505  2.1211  1.5024  0.2054  0.0851 

38.2894  3.6977  3.6574  0.3985  0.2138 

Table  1  shows  that  Mentions  network  has  a  smaller  WCC1  fraction  than 
the  other  two  networks.  This  is  understandable  in  view  of  the  communication 
aspect  of  Mentions  because  users  do  not  send  ©-messages  to  people  whom  they 
do  not  well.  Table  2  shows  that  Favorites  network  has  smaller  m~ ,  m+,  g~ , 
and  g+  (see  equations  8  and  11)  than  Mentions.  This  is  understandable  because 
only  a  few  users  are  heavy  favorers  and  the  majorities  have  much  less  favorees 
whereas  in  Mentions  the  distribution  of  the  number  of  mentions  of  each  user  is 
less  distorted,  which  makes  the  average  degree  of  Mentions  network  larger  than 
that  of  Favorites  network. 


4  Results 


In  this  section,  we  report  the  results  of  analysis  using  various  measures  explained 
in  2. 
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4.1  Simple  Directed  Graph 

As  seen  from  Table  1,  Favorites  and  Follow  networks  have  each  a  large  weakly 
connected  component  which  includes  almost  all  nodes  but  Mentions  network  is 
not  so.  Since  Mentions  network  is  too  large  to  analyze  for  all  nodes,  we  use 
WCC1  in  the  following  analysis  for  Mentions  network. 

Degree  Distribution  Figures  2,  3,  4,  5,  6,  and  7  are  the  results  of  degree 
distribution  of  the  three  networks.  Blue  and  red  diamond  marks  indicate  id  and 
od  (see  equation  (3)),  respectively.  The  vertical  axis  indicates  the  number  of 
nodes  in  logarithmic  scale.  From  these  pictures,  we  see  that  all  the  networks  can 
be  said  to  have  a  scale-free  property  for  both  in-degree  or  out-degree. 


Network  Motif  Figures  8  and  9  are  the  results  of  network  motif  analysis. 
The  horizontal  axis  indicates  the  motif  number  explained  in  4.  In  Figure  8  the 
vertical  axis  indicates  the  frequency  of  appearance  in  logarithmic  scale,  and  in 
Figure  9  the  vertical  axis  indicates  2-score  (see  equation  (4))  in  logarithmic  scale. 
Magenta  and  cyan  bars  mean  positive  score  and  negative  score  respectively.  From 
these  figures,  we  see  that  there  are  three  predominant  motifs:  patterns  13,  12, 
and  8,  which  are  all  characterized  by  having  mutual  links,  The  results  of  Follow 
and  Mentions  networks  are  similar  to  these  figures,  so  we  omit  showing  these 
results. 

4.2  Visualization  of  Bipartite  Graph 

Figure  10  is  the  result  of  visualization  of  bipartite  graph  of  Favorites.  In  this 
analysis  we  used  the  data  retrieved  from  only  July  1st  to  7th  2011  because  so 
many  links  obscure  the  graph.  Nodes  on  the  outer  circle  are  Favorers,  and  nodes 
on  the  inner  circle  are  Favorees.  Blue  and  Red  nodes  are  users  who  are  ranked 
Favorer/Favoree’s  top  10.  Only  links  with  more  than  or  equal  to  10  multiplicity 
are  shown  by  gray  lines. 

NHK_PR  is  the  official  account  of  NHK’s  PR  section5,  and  sasakitoshinao  is 
the  account  of  freelance  journalist.  His  tweets  are  on  serious  and  important 
topics,  for  instance,  current  news  or  opinions  about  it.  On  the  other  hand, 
kaiten_keiku  and  Satomii_Opera  are  regular  users  of  Twitter,  and  their  tweets 
are  often  negative  and/or  ’’geeky”. 

From  this  figure,  we  see  there  are  two  clusters  of  popular  users  which  are 
characterized  by  their  content  of  tweets,  one  with  serious  and  important  tweets 
and  the  other  with  negative  and/or  geeky  tweets. 

4.3  Multiple  Directed  Graph 

In  this  subsection,  we  show  the  results  of  analysis  using  the  measures  explained 
in  2.3.  In  all  the  figures  below  (Figures  11  to  22),  plots  in  blue  squares  are  for 

5  Japan  Broadcasting  Corporation 
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Fig.  2:  Favorites  network  in-degree  Fig.  3:  Favorites  network  out-degree 


Fig.  4:  Follow  network  in-degree 


indegree  distribution 


Fig.  5:  Follow  network  out-degree 


Fig.  6:  Mentions  network  in-degree  Fig.  7:  Mentions  network  out-degree 
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Fig.  8:  Favorites  network  motif  (fre¬ 
quency) 


Fig.  9:  Favorites  network  motif  (z-score) 


Only  links  with  more  than  or  equal  to 
10  multiplicity  are  shown 


Fig.  10:  Bipartite  Graph  Visualization 
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in-degree,  plots  in  red  squares  are  for  out-degree  and  plots  in  green  circles  are 
for  randomized  networks.  Horizontal  axes  are  all  in  logarithmic  scale. 


Degree  Distribution  Figures  11,  12,  13  and  14  are  the  results  of  degree  dis¬ 
tribution  (see  equation  (6))  for  Favorites  and  Mentions  networks.  The  vertical 
axes  are  frequency  (the  number  of  nodes)  in  logarithmic  scale. 


Fig.  11:  Favorites  in-degree  Fig.  12:  Favorites  out-degree 


Fig.  13:  Mentions  in-degree 


Fig.  14:  Mentions  out-degree 


From  these  figures,  we  see  that  both  networks  have  a  scale-free  property, 
same  as  the  simple  directed  networks  4.1.  We  notice  that  the  distributions  for 
the  randomized  Mentions  network  are  shifted  right  to  the  real  Mentions  network, 
but  this  is  not  so  for  Favorites  network. 


Average  Multiplicity  Figures  15,  16,  17  and  18  are  the  average  multiplicity 
(see  equation  (7))  for  the  both  networks.  The  vertical  axes  are  in  logarithmic 
scale. 

We  notice  the  difference  in  correlation  between  the  two  networks.  On  the 
average,  there  are  positive  correlations  between  the  average  multiplicity  and  the 
degree  for  Favorites  network  (Figures  15  and  16),  but  the  correlations  change 
from  positive  to  negative  as  the  degree  increases  for  Mentions  network  (Figures 
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Fig.  15:  Favorites  in-degree 


Fig.  16:  Favorites  out-degree 


Fig.  17:  Mentions  in-degree 


Fig.  18:  Mentions  out-degree 


17  and  18).  Furthermore,  the  average  multiplicity  of  randomized  Favorites  net¬ 
work  behaves  similarly  to  the  real  Favorites  network,  but  that  of  randomized 
Mentions  network  is  almost  flat  across  all  the  range  of  degree. 


Gini  coefficient  Figures  19,  20,  21  and  22  are  the  results  of  Gini  coefficient 
(see  equation  (10)  for  the  both  networks.  The  vertical  axes  are  in  linear  scale. 

Correlations  between  the  Gini  coefficient  and  the  degree  and  the  relation 
between  the  real  and  the  randomized  networks  are  similar  to  those  for  the  average 
multiplicity,  i.e. ,  positive  correlations  for  Favorites  network  (  Figures  19  and 
20),  positive  to  negative  correlations  for  Mentions  network  (Figures  21  and  22) 
and  more  positive  correlations  for  the  randomized  Favorites  network  than  the 
randomized  Mentions  network. 

5  Discussion 

The  results  in  subsections  4.1  and  4.3  revealed  that  all  the  three  networks  have 
the  scale-free  property,  but  we  notice  that  the  variance  in  the  degree  distribu¬ 
tions  for  Mentions  network  is  smaller  in  high  out-degree  nodes  than  others.  We 
conjecture  that  this  is  due  to  the  communication  aspect  of  Mention  function,  i.e. 
users  do  not  send  many  @-messages  to  people  they  do  not  know  well  and,  thus, 
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Fig.  19:  Favorites  in-degree 


Fig.  20:  Favorites  out-degree 


Fig.  21:  Mentions  in-degree 


Fig.  22:  Mentions  out-degree 


there  are  probably  no  big  hub  nodes  in  Mentions  network.  Further,  this  also 
explains  that  the  fraction  of  the  maximal  weakly  connected  component  (defined 
in  subsection  3)  is  smaller  than  the  other  networks. 

The  results  in  subsection  4.1  revealed  that  there  are  a  few  numbers  of  pre¬ 
dominant  motifs  that  are  characteristic  of  having  mutual  links.  This  accounts 
for  the  fact  that,  taking  Favorites  as  example,  mutual  links  are  easily  created 
between  users  who  have  similar  tastes  because  Favorites  network  is  driven  by 
preference. 

The  results  in  subsection  4.2  that  there  are  two  clusters  of  popular  users 
each  corresponding  to  a  particular  type  of  tweets  are  quite  natural  and  under¬ 
standable.  Whether  these  two  are  the  unique  tweets  and  there  are  no  other  such 
tweets  remains  to  be  explored. 

The  results  in  subsection  4.3  indicate  that  there  are  substantial  difference 
in  the  distributions  of  multiplicity  and  Gini  coefficient  for  high  degree  nodes 
between  Favorites  and  Mentions  networks.  This  is  explainable  considering  the 
difference  in  nature  of  the  two  functions,  Mentions  network  is  driven  by  commu¬ 
nications  between  users.  Sending/receiving  of  @-message  to/from  many  people 
become  less  practical,  thus  less  frequent  for  high  degree  nodes.  Favorites  network 
is  driven  by  preference.  Expressing  preference  (bookmarking  Favorees’  tweets) 
is  much  easier  than  sending/receiving  message,  thus  relatively  more  frequent  for 
high  degree  nodes. 
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The  results  in  subsection  4.3  revealed  that  there  are  positive  correlations  be¬ 
tween  the  Gini  coefficient  and  the  degree  for  all  the  range  of  degree  for  Favorites 
network,  but  not  so  for  Mentions  network.  This  may  suggest  that  Favorers  in 
high  out-degree  tends  to  preferentially  bookmark  specific  Favorees’  tweets,  and 
vice  versa  for  Favorees  in  high  in-degree. 

6  Conclusion 

With  the  final  goal  of  constructing  a  new  user  preference  model  in  daily  activi¬ 
ties  in  mind,  we  analyzed,  from  the  network  structure  perspective,  the  similarity 
and  difference  in  the  user  behavior  of  the  three  functions  of  Twitter:  Favorite, 
Follow  and  Mention.  User  behavior  is  embedded  in  the  logs  that  users  carried 
out  these  functions,  which  are  represented  by  directed  graphs.  Favorite  func¬ 
tion  was  analyzed  using  three  different  graph  representations:  a  simple  graph,  a 
multigraph  and  a  bipartite  graph,  Follow  function  by  one  graph  representation: 
a  simple  graph,  and  Mention  function  by  two  graph  representations:  a  simple 
graph  and  a  multigraph.  We  used  three  real  world  Twitter  logs  to  create  these 
directed  graphs  and  performed  various  kinds  of  analysis  using  several  represen¬ 
tative  measures  for  characterizing  structural  properties  of  graphs,  and  obtained 
several  salient  features. 

Major  findings  are  that  1)  Favorites  and  Followers  networks  have  a  very  large 
connected  component  but  Mentions  network  is  not,  2)  all  the  three  networks 
(both  simple  and  multiple)  have  the  scale-free  property  in  degree  distribution, 
3)  all  three  networks  (simple)  have  predominant  three-node  motifs  having  mutual 
links,  4)  Favorites  networks  have  freaks  in  Gini  coefficient  (one  of  the  measures), 
5)  Favorites  networks  have  two  clusters  of  popular  users,  and  6)  Favorites  and 
Mentions  networks  differ  in  structure  for  high  degree  nodes  in  case  of  multigraph 
representation  reflecting  that  Favorite  operation  is  much  easier  than  Mention 
operation  although  they  are  similar  in  case  of  simple  graph  representation. 

As  an  immediate  future  work,  we  plan  to  obtain  betweenness  centrality,  close¬ 
ness  centrality,  or  k-core  percolation  of  Favorites  network  represented  as  a  multi¬ 
graph  to  further  characterize  use  behavior  and  hopefully  to  extract  enough  reg¬ 
ularity  to  model  user  preference,  and  pursue  the  literature  review  and  usefulness 
of  the  model. 
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Abstract.  We  address  the  problem  of  extracting  the  groups  of  function¬ 
ally  similar  nodes  from  a  network.  As  functional  properties  of  nodes,  we 
focus  on  hierarchical  levels,  relative  locations  and/or  roles  with  respect 
to  the  other  nodes.  For  this  problem,  we  propose  a  novel  method  for 
extracting  functional  communities  from  a  given  network.  In  our  exper¬ 
iments  using  several  types  of  synthetic  and  real  networks,  we  evaluate 
the  characteristics  of  functional  communities  extracted  by  our  proposed 
method.  From  our  experimental  results,  we  confirmed  that  our  method 
can  extract  functional  communities,  each  of  which  consists  of  nodes  with 
functionally  similar  properties,  and  these  communities  are  substantially 
different  from  those  obtained  by  the  Newman  clustering  method. 


1  Introduction 

Finding  groups  of  functionally  similar  nodes  in  a  social  or  information  network 
can  be  a  quite  important  research  topic  in  various  fields  ranging  from  computer 
science  to  sociology.  Hereafter,  such  a  node  group  is  simply  referred  to  as  a 
functional  community.  In  fact,  each  node  which  typically  corresponds  to  a  person 
in  a  social  network  may  have  a  wide  variety  of  functional  properties  such  as 
status,  ranks,  roles,  and  so  forth,  as  described  in  [1],  However,  conventional 
methods  for  extracting  communities  as  densely  connected  subnetworks,  which 
include  the  Newman  clustering  method  based  on  a  modularity  measure  [2],  and 
normalized  cut  [3]  or  ratio  cut  [4]  method  based  on  the  spectral  graph  analysis, 
cannot  directly  deal  with  such  functional  properties.  Evidently,  conventional 
notions  of  densely  connected  subnetworks  such  as  k- core  [5],  &-dense  [6]  and 
fc-clique  [7]  cannot  work  for  this  purpose.  Namely,  it  is  naturally  anticipated 
that  these  existing  methods  have  an  intrinsic  limitation  for  extracting  functional 
communities. 

In  this  study,  as  typical  functional  properties  of  nodes,  we  especially  focus 
on  hierarchical  levels,  relative  locations  and/or  roles  with  respect  to  the  other 
nodes.  This  implies  that  there  exist  some  functionally  similar  nodes  even  if  they 
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are  not  directly  connected  with  each  other.  For  instance,  in  case  of  a  network 
of  employees  relationships  in  a  company,  we  can  naturally  assume  it  to  have  a 
hierarchical  property,  where  the  top  node  corresponds  to  the  president,  and  in 
turn,  the  successive  levels  of  nodes  correspond  to  managers,  section  leaders,  and 
so  on.  For  example,  our  objective  might  be  to  extract  a  group  of  section  leaders 
as  a  functional  community  in  the  network,  even  though  they  may  not  have  direct 
connections  with  each  other.  Similarly,  even  in  case  of  a  hyperlink  network  of 
Web  pages  in  a  site,  we  can  also  assume  it  to  have  a  hierarchical  property,  where 
the  top  node  corresponds  to  the  top  page  at  this  site  served  as  an  entrance,  and 
in  turn,  the  successive  levels  of  nodes  may  correspond  to  Web  pages  containing 
more  specific  topics.  Then  our  objective  might  be  to  extract  a  group  of  Web  pages 
with  the  same  level  of  topic  specificity.  Here  we  should  emphasize  that  extracting 
these  types  of  communities  can  be  a  quite  tough  problem  for  the  conventional 
community  extraction  methods  because  these  existing  methods  mainly  focus  on 
link  densities  among  each  subnetwork  and  between  subnetworks. 

In  this  paper,  we  propose  a  novel  method  for  extracting  functional  communi¬ 
ties  from  a  given  network.  This  algorithm  consists  of  two  steps:  the  method  first 
assigns  a  feature  vector  to  each  node,  which  is  assumed  to  be  some  functional 
properties,  by  using  calculation  steps  of  PageRank  scores  [8]  for  nodes  from 
an  initial  score  vector.  Then,  in  a  case  that  the  supposed  number  of  functional 
communities  is  K ,  the  method  divides  all  the  node  into  K  groups  by  using  the 
tv-medians  clustering  method  based  on  the  cosine  similarity  between  a  pair  of 
the  feature  vectors.  In  our  experiments  using  several  types  of  synthetic  and  real 
networks,  we  evaluate  the  characteristics  of  functional  communities  extracted  by 
our  proposed  method.  To  this  end,  we  utilize  the  visualization  result  of  each  net¬ 
work  where  each  functional  community  is  indicated  by  a  different  color  marker, 
and  these  results  are  contrasted  to  those  obtained  by  the  Newman  clustering 
method  [2]. 

This  paper  is  organized  as  follows:  after  explaining  two  component  algorithms 
in  Section  2,  we  describe  a  detail  of  our  proposed  method  in  Section  3.  Then, 
by  using  a  number  of  visualized  networks,  in  comparison  to  standard  communi¬ 
ties  extracted  by  the  Newman  clustering  method,  we  qualitatively  evaluate  the 
characteristics  of  the  extracted  functional  communities  in  Section  4.  Finally,  we 
describe  our  conclusion  in  Section  5. 

2  Component  Algorithms 

In  this  section,  for  the  sake  of  convenience,  we  explain  two  existing  methods, 
PageRank  [8]  and  tt-medians.  These  are  used  as  component  algorithms  for  our 
newly  proposing  method. 


2.1  PageRank  Revisited 

For  a  given  Web  hyperlink  network  (directed  graph),  we  identify  each  node 
with  a  unique  integer  from  1  to  \V\.  Then  we  can  define  the  adjacency  matrix 
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A  £  {0,  l}lylxlv  I  by  setting  a(u,v )  =  1  if  (u,v)  £  E;  otherwise  a(u,v)  =  0.  A 
node  can  be  self-looped,  in  which  case  a(u,u)  =  1.  For  each  node  v  £  V,  let 
F(v)  and  B{v)  denote  the  set  of  child  nodes  of  v  and  the  set  of  parent  nodes  of 
v,  respectively,  F(v)  =  {w  £V\  (v,w)  £  E},  B(v)  =  {u  £  V;  (u,  v)  £  E}.  Note 
that  v  £  F(v)  and  v  £  B(v)  for  node  v  with  a  self-loop. 

Then  we  can  consider  the  row-stochastic  transition  matrix  P,  each  element 
of  which  is  defined  by  p(u,v )  =  a(u,  v)/\F(u)\  if  |F(w)|  >  0;  otherwise  p{u,v)  = 
z(v),  where  z  is  some  probability  distribution  over  nodes,  i.e.,  z(v)  >  0  and 
Y^vqv  z(v )  =  1-  This  model  means  that  from  dangling  Web  pages  without  out- 
links  ( F(u )  =  0),  a  random  surfer  jumps  to  page  v  with  probability  z(v).  The 
vector  z  is  referred  to  as  a  personalized  vector  because  we  can  define  z  according 
to  user’s  preference. 

Let  y  denote  a  vector  representing  PageRank  scores  over  nodes,  where  y(v)  > 
0  and  v(v)  =  T  Then  using  an  iteration-step  parameter  s,  PageRank 

vector  y  is  defined  as  a  limiting  solution  of  the  following  iterative  process, 

VTs  =  VTs- 1  ((!  -  a)P  +  aezT )  =  (1  -  a)y^±P  +  azT ,  (1) 

where  aT  stands  for  a  transposed  vector  of  a  and  e  =  (1,  •  ■  *  ,  1)T.  In  the  Equa¬ 
tion  (1),  a  is  referred  to  as  the  uniform  jump  probability.  This  model  means 
that  with  the  probability  a,  a  random  surfer  also  jumps  to  some  page  according 
to  the  probability  distribution  z.  The  matrix  ((1  —  a)P  +  aezT)  is  referred  to 
as  a  Google  matrix.  The  standard  PageRank  method  calculates  its  solution  by 
directly  iterating  Equation  (1),  after  initializing  yo  adequately.  One  measure  to 
evaluate  its  convergence  is  defined  by 

\\ys  -  ys-i\\Li  =  Mu)  ~  y«-i(w)l-  (2) 

v£V 

Note  that  any  initial  vector  yo  can  give  almost  the  same  PageRank  scores  if  it 
makes  Equation  (2)  almost  zero  because  the  unique  solution  of  Equation  (1)  is 
guaranteed. 

2.2  iT-medians  Revisited 

For  a  given  set  of  objects  (or  nodes),  denoted  by  V  =  {v,w,  -  ■  ■  },  the  if -medians 
method  first  selects  K  representative  objects  TZ  C  V  according  to  the  following 
objective  function  to  be  maximized. 

f(K)  =  ^™g-,p(v,r).  (3) 

^ '  rG 7c 
vev 

Here  p(v,  r)  stands  for  a  similarity  measure  between  a  pair  of  objects,  v  and  r. 
Then,  from  the  obtained  K  representative  objects  7 Z  =  {r\,  •  •  •  ,  tk},  the  method 
determines  the  K  clusters,  {Ci,  •  •  •  ,  C /<- } ,  by  using  the  following  formula. 

Ck=  {v  £V\rk  =  argrnax, p(v,r)}. 


(4) 
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Finally,  the  method  outputs  {C i,  •  •  •  ,Ck}  as  the  result. 

In  order  to  maximize  Equation  (3)  with  respect  to  TZ,  due  to  simplicity  we 
employ  a  greedy  algorithm  shown  below. 

1.  Initialize  k  •<—  1  and  7Z  <—  0; 

2.  Select  =  argmaxwey\-R,{/(7?.)},  and  set  1Z  1Z  U  {t-*,}; 

3.  If  k  =  K,  output  1Z  —  {j*i,  •  •  •  ,1'k}  and  terminateQ 

4.  Set  k  <—  k  +  1  and  return  to  step  2; 

Here  note  that  in  virtue  of  the  submodularity  of  the  objective  function  defined 
in  Equation  (3) ,  we  can  obtain  a  unique  greedy  solution  whose  worst  case  quality 
is  guaranteed  [9]. 


3  Proposed  Method 


In  this  section,  we  describe  our  proposed  method  for  extracting  functional  com¬ 
munities.  Our  method  utilizes  the  PageRank  score  vectors  at  each  iteration  step 
s,  i.e.,  {y i,  •  •  •  ,  ys}-  Here,  S  stands  for  the  final  step  when  the  PageRank  algo¬ 
rithm  converges.  Then,  for  each  node  v  £  V,  we  can  consider  an  ^-dimensional 
vector  defined  by  xv  =  (yi(v),---  ,ys(v))T ,  where  ys{v)  means  the  PageRank 
score  of  node  v  at  iteration  step  s.  In  our  method,  xv  is  regarded  as  a  functional 
property  vector  of  node  v. 

Here  we  note  a  reason  why  we  employ  the  vector  described  above.  Basically, 
we  assume  that  functional  properties  of  nodes,  such  as  hierarchical  levels,  relative 
locations  and/or  roles  with  respect  to  the  other  nodes  are  embedded  into  the 
network  structure.  On  the  other  hand,  the  PageRank  scores  at  each  iteration 
step  also  reflect  the  network  structure.  Therefore,  as  an  approximation,  we  can 
consider  that  functional  properties  are  also  represented  by  the  vector  xv . 

In  order  to  divide  all  nodes  into  the  K  groups,  our  method  employs  the 
iv-medians  algorithm  described  in  the  previous  section.  To  this  end,  we  need  to 
define  an  adequate  similarity  p(u,  v )  between  the  nodes  u  and  v.  In  our  proposed 
method,  for  each  pair  of  functional  property  vectors,  we  employ  the  following 
cosine  similarity. 


p{u,v) 


'L‘  u  'Ey 

||*u||  M’ 


(5) 


where  ||ccu||  stands  for  the  standard  L2  norm. 

For  a  given  network  G  =  (V,  E)  and  the  number  K  of  functional  communities, 
we  summarize  our  proposed  algorithm  below. 


1.  Calculate  the  PageRank  score  vectors  at  each  time  step  {y±,  ■  ■  ■  ,  ys}\ 

2.  Construct  the  functional  property  vector  x„  for  each  node  v  £  V; 

3.  Calculate  the  cosine  similarity  p(u,  v)  of  xu  and  xv  for  all  node  pair; 

4.  Divide  all  nodes  into  K  clusters  according  to  the  similarity  p(u,v)  by  the 
itT-medians  method; 

5.  Output  functional  communities  {C\,  ■  ■  ■  ,Ck}', 
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4  Experimental  Evaluation 

In  this  section,  using  several  types  of  synthetic  and  real  networks,  we  experi¬ 
mentally  evaluate  the  characteristics  of  functional  communities  extracted  by  our 
proposed  method.  For  this  purpose,  we  utilize  the  visualization  result  of  each  net¬ 
work  where  each  functional  community  is  indicated  by  a  different  color  marker, 
and  these  results  are  contrasted  to  those  obtained  by  the  Newman  method  [2]. 

4.1  Network  Data 

We  describe  a  detail  of  four  networks  used  in  our  experiments. 

First  one  is  a  synthetic  network  with  a  hierarchical  property,  just  like  an  em¬ 
ployee  relationships  or  Web  hyperlinks  network.  In  this  hierarchical  network,  we 
can  assume  two  types  of  nodes,  central  (or  high  status)  and  peripheral  (or  low 
status)  nodes.  As  shown  later  in  Fig.  1,  in  terms  of  its  basic  network  statistics, 
the  central  nodes  are  characterized  by  relatively  high  degree  and  low  clustering 
coefficients,  while  the  peripheral  nodes  by  relatively  low  degree  and  high  clus¬ 
tering  coefficients.  We  generated  this  network  according  to  Ravasz  et.al.  [10]. 
Hereafter,  this  network  is  referred  to  as  Hierarchical  network. 

Second  one  is  a  two  dimensional-grid  network  implemented  as  a  set  of  10  x 
10  lattice  points.  Evidently,  as  shown  later  in  Fig.  2,  because  of  the  regular 
structure,  dividing  this  network  into  several  portions  does  not  make  sense  in  the 
aspects  of  standard  community  extraction.  Whereas,  we  can  consider  a  functional 
property  in  terms  of  relative  locations  to  other  nodes,  i.e. ,  the  relative  closeness 
to  the  center  position.  Hereafter,  this  network  is  referred  to  as  Lattice  network. 

Third  one  is  a  social  network  of  people  belonging  to  a  karate  circle,  which 
has  been  widely  used  as  a  benchmark  network.  As  shown  later  in  Fig.  3,  we  see 
a  number  of  hub  nodes,  which  play  an  important  role  to  connect  other  nodes. 
Namely,  we  can  assume  that  some  group  of  nodes  has  a  similar  role  with  respect 
to  the  other  nodes.  Hereafter,  this  network  is  referred  to  as  Karate  network  [11]. 

Forth  one  is  a  hyperlink  network  of  a  Japanese  university  Web  site,  where  we 
obtained  this  network  by  crawling  the  Web  site  as  of  Aug.  2010.  As  shown  later 
in  Fig.  4,  there  exist  a  number  of  unique  characteristics  in  this  network.  Namely, 
we  can  assume  that  some  group  of  Web  pages  has  a  similar  topic  specificity  level. 
Hereafter,  this  network  is  referred  to  as  Hosei  network  2. 

Table  1  shows  the  basic  statistics  of  the  Hierarchical,  Lattice,  Karate  and 
Hosei  networks.  Here,  C  and  L  denote  the  averages  of  clustering  coefficients  and 
shortest  path  lengths,  respectively. 


4.2  Experimental  Settings 

We  first  explain  the  settings  of  our  proposed  algorithm.  In  order  to  calculate  the 
PageRank  score  vectors,  we  set  the  initialized  vector  to  y0  =  (1/|  V| , . . . ,  l/|Vj)T, 

2  The  site  name  and  its  address  are  ’’Faculty  of  Computer  and  Information  Sciences, 
Hosei  University”  and  http://cis.k. hosei. ac.jp/  ,  respectively. 
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Table  1.  Basic  statistics  of  networks. 


network 

\v\ 

\E\ 

C 

L 

Hierarchical 

125 

410 

0.84 

2.13 

Lattice 

100 

180 

0.00 

4.59 

Karete 

34 

78 

0.57 

2.03 

Hosei 

600 

1299 

0.54 

4.22 

and  the  convergence  criterion  defined  in  Equation  2  is  implemented  as  \\ys  — 
Vs-i\\li  <  10-12.  The  number  K  of  communities  to  be  extracted  is  changed 
from  K  =  2  to  10. 

As  mentioned  earlier,  we  attempt  to  clarify  the  characteristics  of  the  func¬ 
tional  communities  extracted  by  our  method,  in  comparison  to  standard  commu¬ 
nities  extracted  by  the  Newman  clustering  method  [2],  Hereafter,  such  a  standard 
community  is  simply  referred  to  as  a  Newman  community.  The  Newman  method 
is  basically  designed  to  obtain  densely  connected  subnetworks  by  maximizing  a 
modularity  measure. 

Finally,  we  describe  methods  to  visualize  each  network.  In  Hierarchical  net¬ 
work,  we  employ  nodes’  positions  as  displayed  by  Ravasz  et.al.  [10].  As  for  Lattice 
network,  we  can  regularly  assign  the  positions  to  nodes.  In  cases  of  Karate  and 
Hosei  networks,  the  cross-entropy  embedding  method  [12]  is  used  to  determine 
the  positions  of  nodes. 

4.3  Experimental  results 

We  show  the  experimental  results  of  Hierarchical  network  at  K  =  5  in  Fig.  1. 
Here  note  that  this  network  consists  of  five  portions  of  densely  connected  sub¬ 
networks,  as  observed  in  Fig.  1.  Thus,  as  an  example,  we  selected  this  number, 
K  =  5.  As  expected,  from  Fig.  1(a),  we  see  that  our  method  could  extract  rea¬ 
sonable  functional  communities,  each  of  which  consists  of  nodes  with  the  similar 
hierarchical  levels,  just  like  employees  with  same  position  such  as  the  president, 
managers,  or  general  staffs.  On  the  other  hand,  from  Fig.  1(b),  we  see  that  the 
Newman  method  extracted  standard  communities,  each  of  which  is  characterized 
as  a  densely  connected  subnetwork,  just  like  employees  belonging  to  the  same 
department  or  section. 

We  show  the  experimental  results  of  Lattice  network  at  K  =  3  in  Fig.  2. 
Here  recall  that  in  the  aspects  of  standard  community  extraction,  dividing  this 
network  into  several  portions  does  not  make  sense.  Thus,  as  an  example,  we 
selected  this  relatively  small  number,  K  =  3.  From  Fig.  2(a),  we  see  that  our 
method  could  extract  reasonable  functional  communities,  each  of  which  consists 
of  nodes  with  the  similar  relative  locations,  i.e. ,  the  relative  closeness  to  the 
center  position.  On  the  other  hand,  as  shown  in  Fig.  2(b),  we  can  hardly  make 
sense  to  the  communities  extracted  by  the  Newman  method. 

We  show  the  experimental  results  of  Karate  network  at  K  =  2  in  Fig.  3. 
Here  note  that  this  network  consists  of  two  portions  of  densely  connected  sub- 
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networks,  as  observed  in  Fig.  3.  Thus,  as  an  example,  we  selected  this  number, 
K  =  2.  From  Fig.  3(a),  we  see  that  our  method  could  extract  reasonable  func¬ 
tional  communities,  each  of  which  consists  of  nodes  with  different  roles  with 
respect  to  the  other  nodes,  i.e.,  groups  of  hub  nodes  and  the  other  nodes.  On  the 
other  hand,  from  Fig.  3(b),  we  see  that  the  Newman  method  extracted  standard 
communities,  each  of  which  is  characterized  as  a  densely  connected  subnetwork. 

We  show  the  experimental  results  of  Hosei  network  at  K  =  10  in  Fig.  4.  Here 
note  that  this  network  consists  of  several  portions  of  characteristically  connected 
subnetworks,  as  observed  in  Fig.  4.  Thus,  as  an  example,  we  selected  this  rela¬ 
tively  large  number,  K  =  10.  From  Fig.  4(a),  we  see  that  our  method  extracted 
several  communities,  each  of  which  consists  of  nodes  with  similar  connection 
patterns.  In  order  to  more  closely  investigate  these  extracted  communities,  we 
focused  on  a  particular  community  indicated  by  small  blue  squares  surrounding 
with  large  transparent  squares  in  Fig.  4(a).  From  our  examination  of  these  Web 
pages  belonging  to  this  community,  we  realized  that  these  Web  pages  correspond 
to  annual  reports  of  each  year  produced  by  faculty  members.  Namely,  it  is  as¬ 
sumed  that  these  Web  pages  in  this  community  have  a  similar  topic  specificity 
level.  Thus,  we  can  consider  that  our  method  could  extract  a  piece  of  reasonable 
functional  communities  in  the  sense  described  above.  On  the  other  hand,  from 
Fig.  4(b),  we  see  that  the  Newman  method  divided  the  functional  community 
focused  above  into  several  communities. 

From  our  experimental  results  using  these  networks  with  different  character¬ 
istics,  we  confirmed  that  our  method  could  extract  functional  communities,  each 
of  which  consists  of  nodes  with  similar  functional  properties  such  as  hierarchi¬ 
cal  levels,  relative  locations  and/or  roles  with  respect  to  the  other  nodes.  These 
results  indicate  that  our  method  is  promising  for  tasks  of  extracting  functional 
communities  with  these  properties.  On  the  other  hand,  the  Newman  method  ex¬ 
tracted  standard  communities  characterized  by  densely  connected  subnetworks. 
From  these  results,  we  see  that  these  functional  communities  extracted  by  our 
method  are  substantially  different  from  those  obtained  by  the  Newman  method. 


5  Conclusion 

We  addressed  the  problem  of  extracting  the  groups  of  functionally  similar  nodes 
from  a  network.  In  this  paper,  such  a  node  group  was  simply  referred  to  as  a 
functional  community.  As  functional  properties  of  nodes,  we  focused  on  hierar¬ 
chical  levels,  relative  locations  and/or  roles  with  respect  to  the  other  nodes,  and 
proposed  a  novel  method  for  extracting  functional  communities  from  a  given 
network.  In  our  experiments  using  several  types  of  synthetic  and  real  networks, 
we  evaluated  the  characteristics  of  functional  communities  extracted  by  our  pro¬ 
posed  method.  From  our  experimental  results,  we  confirmed  that  our  method 
could  extract  functional  communities,  each  of  which  consists  of  nodes  with  func¬ 
tionally  similar  properties,  and  these  communities  were  substantially  different 
from  those  obtained  by  the  Newman  clustering  method.  In  future,  we  plan  to 
evaluate  our  method  using  various  networks. 
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(a)  Functional  community 


(b)  Newman  community 


Fig.  1.  Hierarchical  Network  ( K  =  5) 


(a)  Functional  community 


(b)  Newman  community 


Fig.  2.  Lattice  Network  (A'  =  3) 
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Fig.  3.  Karate  Network  ( I\  =  2) 


(a)  Functional  community 


(b)  Newman  community 


Fig.  4.  Hosei  Network  ( K  =  10) 
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Abstract.  Social  networks  play  an  important  role  for  spreading  information  and 
forming  opinions.  A  variety  of  voter  models  have  been  defined  that  help  analyze 
how  people  make  decisions  based  on  their  neighbors’  decisions.  In  these  stud¬ 
ies,  common  practice  has  been  to  use  the  latest  decisions  in  opinion  formation 
process.  However,  people  may  decide  their  opinions  by  taking  account  not  only 
of  their  neighbors’  latest  opinions,  but  also  of  their  neighbors’  past  opinions.  To 
incorporate  this  effect,  we  enhance  the  original  voter  model  and  define  the  tempo¬ 
ral  decay  voter  (TDV )  model  incorporating  a  temporary  decay  function  with  pa¬ 
rameters,  and  propose  an  efficient  method  of  learning  these  parameters  from  the 
observed  opinion  diffusion  data.  We  further  propose  an  efficient  method  of  select¬ 
ing  the  most  appropriate  decay  function  from  among  the  candidate  functions  each 
with  the  optimized  parameter  values.  We  adopt  three  functions  as  the  typical  can¬ 
didates:  the  exponential  decay,  the  power-law  decay,  and  no  decay,  and  evaluate 
the  proposed  method  (parameter  learning  and  model  selection)  through  extensive 
experiments.  We,  first,  experimentally  demonstrate,  by  using  synthetic  data,  the 
effectiveness  of  the  proposed  method,  and  then  we  analyze  the  real  opinion  dif¬ 
fusion  data  from  a  Japanese  word-of-mouth  communication  site  for  cosmetics 
using  three  decay  functions  above,  and  show  that  most  opinions  conform  to  the 
TDV  model  of  the  power-law  decay  function. 


1  Introduction 

Social  networking  services  (SNSs)  on  the  Internet,  such  as  Facebook,  Twitter  and  Digg, 
have  become  so  popular  and  use  of  these  services  is  now  a  part  of  our  daily  activi¬ 
ties.  Large  networks  formed  by  these  services  play  an  important  role  as  a  medium  for 
spreading  diverse  information  including  news,  ideas,  opinions,  and  rumors  [18, 17, 8, 
6],  Users  of  these  services  can  share  their  interests  or  opinions  to  each  other.  The  re¬ 
sulting  social  networks  and  the  information  propagated  therein  have  great  influence 
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on  and  drastically  change  our  decision  making  processes  and  behaviors  in  daily  life. 
Thus,  many  attempts  have  been  made  to  investigate  the  spread  of  influence  in  social 
networks  [15,5,21], 

One  such  typical  and  well  studied  problem  in  social  network  analysis  is  the  influence 
maximization  problem,  which  is  finding  a  limited  number  of  influential  nodes  that  are 
effective  for  spreading  information  [10, 11, 16,3,4],  What  is  common  to  these  studies 
is  that  models  used  allow  a  node  in  the  network  to  take  only  one  of  the  two  states, 
i.e.,  either  active  or  inactive,  because  the  focus  is  on  influence.  However,  we  need  a 
model  in  which  a  node  can  take  multiple  states  for  such  applications  in  which  a  user 
can  choose  one  from  multiple  choices.  For  example,  a  mobile  phone  user  may  change 
his/her  current  carrier  to  the  one  which  the  majority  of  his/her  neighbors  are  using. 
To  model  this  kind  of  opinion  formation  dynamics,  a  node  in  the  network  has  to  be 
able  to  take  one  of  many  possible  choices  as  its  state.  A  voter  model  would  be  the  one 
which  is  most  suitable  for  this  purpose.  It  is  one  of  the  most  basic  stochastic  process 
models,  where  a  node  decision  is  influenced  by  its  neighbors’  decisions  [20, 9, 7, 2, 22]. 
We  proposed  two  variants  of  voter  model  in  our  past  work:  the  value-weighted  voter 
model  that  considers  opinion  values  [12],  and  the  value-weighted  mixture  voter  model 
that,  in  addition  to  the  opinion  values,  considers  the  effect  of  anti-majoritarians,  i.e., 
those  people  who  do  not  agree  with  the  majority  and  support  the  minority  opinion  [13]. 

In  this  paper  we  also  address  the  problem  of  opinion  formation  on  the  social  net¬ 
work,  but  we  especially  focus  on  the  fact  that  our  decision  may  be  influenced  not  only  by 
our  neighbors’  and  our  own  latest  opinions,  but  also  by  the  neighbors’  and  our  own  past 
opinions.  For  example,  assume  that  you  and  your  friends  have  long  supported  a  certain 
political  party,  but  many  of  your  friends  have  started  changing  their  supporting  party  to 
a  different  one  very  recently.  Under  this  situation,  you  may  still  stick  to  your  opinion 
and  keep  supporting  the  party,  or  you  may  change  your  mind  and  follow  your  neigh¬ 
bors’  opinions.  This  means  that  your  current  opinions  are  influenced  not  only  by  the 
neighbors’  latest  opinions  but  also  by  their  past  opinions  including  your  own  opinions. 
It  is,  thus,  important  to  consider  all  the  past  opinions  in  making  the  current  decision. 
Nonetheless,  all  the  voter  models  including  the  two  variants  mentioned  above  consider 
only  the  latest  opinions  of  its  neighbors  including  itself  when  updating  the  opinion  of  a 
node. 

With  this  in  mind  we  enhance  the  original  voter  model  and  define  the  temporal 
decay  voter  (TDV)  model  that  takes  into  account  all  the  past  opinions  discounting  the 
effect  of  older  opinions  by  using  a  temporal  decay  function.  The  work  most  closely 
related  to  our  approach  would  be  the  work  by  Koren  [14]  which  is  in  the  context  of  rec- 
ommender  systems,  where  several  time  drifting  user  preference  models  are  proposed, 
some  of  which  adopt  a  temporal  decay  function  that  discounts  the  effect  of  older  ratings 
to  items.  The  approach  in  Koren’s  work  is,  unlike  our  approach,  cannot  utilize  all  the 
past  ratings  given  by  a  user  for  an  identical  item  because  the  user-item  matrix  that  they 
use  does  not  allow  multiple  ratings  to  be  stored.  In  addition,  due  to  the  framework  of 
collaborative  filtering,  it  requires  the  rating  history  involving  multiple  items,  while  our 
approach  can  model  the  temporal  dynamics  of  opinions  for  a  single  item.  Thus,  it  does 
not  make  sense  to  compare  Koren’s  approach  with  ours. 
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Our  major  contribution  is  the  following  four:  1)  the  TDV  model,  2)  an  algorithm 
of  learning  the  parameters  of  the  temporal  decay  function  from  the  observed  opinion 
spreading  data,  3)  a  model  selection  method  that  determines  the  most  appropriate  de¬ 
cay  function  for  given  data,  and  4)  new  finding  regarding  the  decay  model  from  the 
analysis  of  the  real  data.  The  model  parameters  are  learned  by  an  efficient  iterative  al¬ 
gorithm  which  maximizes  the  likelihood  function.  Three  representative  decay  functions 
are  employed,  although  the  framework  is  not  necessarily  limited  to  them:  the  exponen¬ 
tial  decay,  the  power-law  decay,  and  no  decay.  Which  function,  each  with  the  optimized 
parameter  values,  is  most  appropriate  for  given  data  is  determined  based  on  the  log 
likelihood  ratio  statistic.  We  evaluate  the  parameter  learning  and  the  model  selection 
methods  through  extensive  experiments  using  synthetic  data  with  two  TDV  models, 
one  with  the  exponential  decay  and  the  other  with  power-low  decay.  We  then  applied 
the  methods  to  the  real  opinion  spreading  data  from  a  Japanese  word-of-mouth  commu¬ 
nication  site  for  cosmetics  using  aforementioned  three  decay  functions,  and  show  that 
most  opinions  conform  to  the  TDV  model  of  the  power-law  decay  function. 

The  paper  is  organized  as  follows.  We  define  the  TDV  model  in  Section  2  and  ex¬ 
plain  how  the  model  parameters  are  learned  and  the  most  appropriate  model  is  selected 
in  Section  3.  The  performance  of  parameter  learning  and  model  selection  using  the 
synthetic  data  is  reported  in  Section  4  and  the  finding  from  the  analysis  of  real  data  is 
reported  in  Section  5.  We  end  this  paper  by  summarizing  the  main  result  in  Section  6. 


2  Voter  Model  with  Temporal  Decay  Dynamics 

We  define  the  TDV  (Temporal  Decay  Voter)  model.  Let  G  =  (V,  E)  be  a  directed  network 
with  self-loops,  where  V  and  E  (c  V  x  V)  are  the  sets  of  all  nodes  and  links  in  the 
network,  respectively.  Here,  (m,  v)  6  E  denotes  a  (directed)  link  from  node  u  to  node 
v.  When  there  is  a  link  ( u ,  v),  we  assume  that  v  can  be  influenced  by  its  neighbor  u  in 
opinion  formation  process.  For  a  node  v  6  V,  let  B(v)  denote  the  set  of  neighbors  of  v 
in  G,  that  is, 

B(v )  =  {«  €  V\  (u,  v)  e  E). 

Note  that  v  6  B(v).  Given  an  integer  K  with  K  >  2,  we  consider  the  spread  of  K  opinions 
(opinion  1,  ■  •  • ,  opinion  K )  on  G,  where  each  node  holds  exactly  one  of  the  K  opinions 
at  any  time  t  (>  0).  We  assume  that  each  node  of  G  initially  holds  one  of  the  K  opinions 
with  equal  probability  at  time  t  —  0.  We  denote  by 


8,  :  V ,K] 


the  opinion  distribution  at  time  t,  where  g,(v)  stands  for  the  opinion  of  node  v  at 
time  t.  Note  that  go  stands  for  the  initial  opinion  distribution.  For  any  v  e  V  and 
k  €  {1, 2,  ■  •  •  ,  K),  let  U tit,  v)  be  the  set  of  v’s  neighbors  that  hold  opinion  k  as  its  latest 
opinion  (before  time  t),  i.e., 

Uk(t,  v)  =  [u  e  B(v)\  ip,(u)  =  k], 


where  < pt(u)  is  the  latest  opinion  of  u  (before  time  t). 
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2.1  Voter  Model 

We  first  recall  the  definition  of  the  voter  model  (see,  e.g.,  [13]),  which  is  one  of  the 
standard  models  of  opinion  dynamics,  where  K  is  usually  set  to  2.  The  evolution  process 
of  the  voter  model  is  defined  as  follows: 

1.  At  time  0,  each  node  v  independently  decides  its  update  time  t  according  to  some 
probability  distribution  such  as  an  exponential  distribution  with  parameter  yv  =  l.1 
The  successive  update  time  is  determined  similarly  at  each  update  time  t. 

2.  At  an  update  time  f,  the  node  v  adopts  the  opinion  of  a  randomly  chosen  neighbor 
u,  i.e., 

gt(v)  =  ip,(u). 

3.  The  process  is  repeated  from  the  initial  time  t  =  0  until  the  next  update-time  passes 
a  given  final-time  T \ . 

We  note  that  in  the  voter  model  each  individual  tends  to  adopt  the  majority  opinion 
among  its  neighbors.2  Here  note  that  the  definition  of  one’s  neighbors  include  oneself 
because  of  the  existence  of  self  loop.  Thus,  we  can  extend  the  original  voter  model  with 
2  opinions  to  a  voter  model  with  K  opinions  by  replacing  Step  2  with:  At  an  update  time 
t,  the  node  v  selects  one  of  the  K  opinions  according  to  the  probability  distribution, 

P(g'(y)  =  k)  =  {k=  h  -  ,K).  (1) 

|£(v)| 


2.2  Temporal  Decay  Voter  Model 

As  mentioned  earlier,  people  may  decide  their  opinions  by  taking  account  not  only  of 
their  neighbors’  latest  opinions,  but  also  of  their  neighbors’  past  opinions  including 
their  own  opinions.  In  order  to  model  this  kind  of  situation,  for  any  t  >  0  and  v  e  V,  we 
consider  the  set  M(f,  v)  consisting  of  the  time  r  (<  t)  at  which  an  individual  (a  node)  v 
manifested  his/her  opinion.  For  k  —  1,  •  •  •  ,K,  we  also  consider  a  subset  of  M(t,  v). 


Mk(t,v )  =  {t  e  M(t,  v);  gr(v)  =  k }, 


where  Mk(t ,  v)  is  the  set  of  node  v’s  opinion  manifestation  time  instances  before  time  t 
in  which  v  takes  opinion  k.  Now,  we  can  define  a  voter  model  which  takes  all  the  past 
opinions  into  consideration.  In  this  model,  Eq.  (1)  is  replaced  with 


P(g,(v)  =  k) 


1  +  2HeB(v)  I Mk(t,  U)\ 

K  +  YjneB(v)  I M(t,  u) | 


(2) 


where  we  employed  a  Bayesian  prior  known  as  the  Laplace  smoothing.  Here  we  note 
that  the  Laplace  smoothing  of  Eq.  (2)  corresponds  to  the  assumption  that  each  node  ini¬ 
tially  holds  one  of  the  K  opinions  with  equal  probability  at  time  t  -  0.  Note  also  that  the 

1  This  assumes  that  the  average  delay  time  is  1. 

2  In  reality  there  may  be  a  case  that  one  changes  its  opinion  to  a  medium  one  (say  3)  listening 
to  two  opposite  opininons  (say  1  and  5).  The  voter  model  does  not  consider  this  possibility 
unless  at  least  one  of  the  neighbors  has  already  the  medium  opinion  (3). 
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Laplace  smoothing  corresponds  to  a  special  case  of  Dirichlet  distributions  that  are  very 
often  used  as  prior  distributions  in  Bayesian  statistics,  and  in  fact  the  Dirichlet  distribu¬ 
tion  is  the  conjugate  prior  of  the  categorical  distribution  and  multinomial  distribution. 
We  refer  to  this  voter  model  as  the  base  TDV  model. 

Thus  far,  we  assumed  that  all  the  past  opinions  are  equally  weighted.  However,  it  is 
naturally  conceivable  that  the  quite  old  opinions  have  almost  no  influence.  Older  opin¬ 
ions  are  less  influential  in  general.  In  order  to  reflect  this  kind  of  effects  into  the  model, 
we  consider  introducing  some  decay  functions.  The  simplest  one  is  an  exponential  de¬ 
cay  function  defined  by 

p(At\A)  =  exp  (-AAt),  (3) 

where  A  >  0  is  a  parameter  and  At  =  t  -  t  stands  for  the  time  difference  between 
the  opinion  adoption  time  t  and  the  opinion  manifestation  time  t.  Another  natural  one 
would  be  a  power-law  decay  function  defined  by 

p(At;A)  =  (At)~/l  =  exp(-Tlogzlf),  (4) 

where  A  >  0  is  a  parameter. 

Now,  we  construct  a  more  general  decay  function.  For  a  given  positive  integer  /, 
let  fi(At),  •  •  ■  ,fj(At )  be  functions  on  (0,  +oo)  such  that  1,  f\  (At),  ■■  ■ ,  fj(At)  are  linearly 
independent,  that  is,  if  To,  A i ,  •  •  ■  ,Aj  are  real  numbers  and  satisfy 

j 

To  +  J]  Ajfj(At)  =  0,  (VAt  e  (0,  +«,)), 

7=1 

then  To  =  T|  =  •  •  •  =  Aj  =  0.  We  then  consider  a  ./-dimensional  feature  vector, 

Fj(At)  =  (J](At),---  ,fj(At))T, 

where  aT  denote  the  transpose  of  column  vector  a.  For  a  /-dimensional  real  column 
vector  with  non-negative  elements, 

Aj  =  (Aw-  ,  Aj)T , 


which  is  a  parameter  vector,  we  define  a  decay  function  p(At\  Ty)  by 

p(At;Aj)  =  exp(-A/Fj(At)),  (5) 

where  the  matrix  operations  are  used.  Representative  candidates  of  feature  vector  F j(At) 
include 

F,(At)  =  At,  F\(At)  =  log  At,  F\{At)  =  {At)2 

for  /  =  1, 

F2(At)  =  (At,  logAtf,  F2(At)  =  [At,  (At)2)',  F2(At)  =  (log At,  (At)2)7 
for  /  =  2, 

F2(At)  =  [At,  log  At,  (At)2)7 
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for  J  -  3,  etc.  Note  that  p(At\  Aj)  becomes  the  exponential  decay  function  if  /  =  1  and 
F j(At)  =  At,  and  the  power-law  decay  function  if  /  =  1  and  Fj(At )  =  log /It. 

Using  our  general  decay  function  p(At;  Aj)  (see  Eq.  (5)),  we  define  the  TDV  (Tem¬ 
poral  Decay  Voter)  model  in  the  following  way.  In  this  model,  Eq.  (1)  is  replaced  with 


P(g,(v)  =  k) 


i  "h  Z  ucB{  v)  2jT£M/;(t,u)  P(t  T  Aj) 

F  +  ZhcB(v)  ZrC/V7(M/)  fAl  ~  r,  Aj) 


(k  =!,•••  ,K). 


(6) 


Here  note  that  Eq.  (6)  is  reduced  to  Eq.  (2)  when  Aj  is  the  /-dimensional  zero-vector 
0/,  that  is,  the  TDV  model  of  Aj  =  0/  coincides  with  the  base  TDV  model. 


3  Learning  Method 

We  consider  the  problem  of  identifying  the  TDV  model  on  network  G  from  an  observed 
data  DTo  in  time-span  [0,  To],  where  Oj0  consists  of  a  sequence  of  ( k ,  t,  v)  such  that  node 
v'  changed  its  opinion  to  opinion  k  at  time  t  for  0  <  t  <  Tq.  The  identified  model  can  be 
used  to  predict  how  much  of  the  share  each  opinion  will  have  at  a  future  time  T \  (>  Tq), 
and  to  identify  both  high  decay  tendency  data  sets  and  low  decay  tendency  data  sets. 


3.1  Parameter  Estimation 


We  describe  a  method  for  estimating  decay  parameter  values  of  the  TDV  model  from 
a  given  observed  opinion  spreading  data  Dj:.  Based  on  the  evolution  process  of  our 
model  (see  Eq.  (6)),  we  can  obtain  the  likelihood  function, 


-C(Dt0',Aj)  -  log 


PI  P(g,(v)  =  k) 

Sk,l,v)eDTo 


(7) 


where  A  j  stands  for  the  /-dimensional  vector  of  decay  parameter  values,  as  explained  in 
the  previous  subsection.  Thus  our  estimation  problem  is  formulated  as  a  maximization 
problem  of  the  objective  function  with  respect  to  //. 

We  derive  an  iterative  algorithm  for  obtaining  the  maximum  likelihood  estimators. 
From  the  definitions  of  P(gt(v)  =  k)  (see  Eq.  (6))  and  p(At;Aj)  (see  Eq.  (5)),  we  can 
express  Eq.  (7)  as  follows: 


-C(DTo-,Aj)=  log  1  +  X  exP  (~^JT FAt  -  t)) 

(k,t,v)£DT0  ueB(v)reMk(t,u) 

~  J]  log  K+  X  Yj  exP  (-AjTFj(t  -  r)) 

( k,t,v)eDr0  ueB(v )  reM(t,u ) 


(8) 


Now,  let  A  j  be  the  current  estimate  of  Aj.  We  foucus  on  the  first  term  of  the  right-hand 
side  of  Eq.  (8),  and  define  qk,t,v(j\  Aj)  by 

exp  (-AjTFj{t  -  t)) 

1  +  ZU€B(v)  Z T'eMt(t,u )  exP  \-AjTFj(t  -  t0) 
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for  any  k  e  { 1 ,  •  •  •  ,/if},  t  e  (0,  T],  v  e  V ,  and  r  e  UueBoo  Mt(t,  u).  Note  that  for  any 
(k,  t,v)  e  DTo, 


quA t;  Aj)  >  0, 

z  z  Aj)  + 


\ 

Vt  e  MAt,  u) 

,  ueB(v)  , 


ueB(v)  T€My(t,u)  ^  2mg6(v)  (  Aj  Fj(t  T  )  j 

We  can  transform  our  objective  function  as  follows: 

£(DTo-Aj)  =  Q(Aj;Aj)-'H(Aj;Aj), 
where  <3  (d y;  Aj)  is  defined  by 

(2  (Ay;  Ay)  =  -  XXX  qk.t.v(T’Aj)  AjT Fj(t  -  r) 

(k,t,v)e£)T0  ueB(v )  reM/c(t,u) 

-  X  1°§  K  +  X  X  exP  (-AjTF, j(t  -  r)) 

(k,t,v)eDT0  ueB(v)  reM(t,u) 

and  W  (Aj;  Aj)  is  defined  by 

■h(m)=  z  z  z  Vk.t.v  (t\  Aj)  log  qkAv(r;  Ay) 

( k,t,v)e£>T0  yueB(v)  TeMk(t,u) 


(9) 

=  I-  (10) 

(11) 


l  +  ZH£«(v)  Yjt’emmm)  exP  -h  -  r 


[-AjTFj{t  -  r')) 


(12) 


x  log 


,  1  +  Z«€fl(v)  Zr'E Mk(t,u)  exP  (-AyTFy(t  -  T')) 


■•(13) 


By  Eqs.  (9),  (10),  (13),  and  the  property  of  the  KL-divergence,  it  turns  out  that  77  (Ay;  Ay) 
is  maximized  at  Ay  -  Ay.  Hence,  we  can  increase  the  value  of  X.  (Dy0;Aj)  by  maximiz¬ 
ing  <3  (Aj;  Aj)  with  respect  to  Ay  (see  Eq.  (11)). 

We  derive  an  update  formula  for  maximizing  Q(Ay;  Ay).  We  foucus  on  the  second 
term  of  the  right-hand  side  of  Eq.  (12)  (see  also  the  second  term  of  the  right-hand  side 
of  Eq.  (8)),  and  define  rM.(r;  Ay)  by 


D,v(t;  Aj) 


exp  (-AjT  F y(t  -  t)) 

K  +  Z«€fl(v)  Xt'emu.u)  exp  (-AyTFy{t  -  r')) 


(14) 


for  any  t  e  (0,  T],  v  e  V,  and  r  e  UueB(r)  M{t,  u).  Note  that  for  any  ( k ,  /,  v)  e  Dy0, 


D,v(r;  Aj)  >  0, 


/  ' 
Vt  e  |^J  M(t,  u) 

,  ueB(v)  v 


X  X  rtAT-Aj)<1- 

ueB(v )  reM(t,u) 


(15) 
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From  Eqs.  (12)  and  (14),  we  can  easily  see  that  the  gradient  vector  of  Q(Aj;Aj)  with 
respect  to  Aj  is  given  by 


dQ(Aj\Aj ) 


dA, 


=  -  Z  Z  Z  q,’v’k  (T:  Aj) Fj(t  ~  T) 

(t,v,k)eDT0  ueB(v )  \reMk(t,u) 

-  Z  rt,y{T\  Aj)F ](t  -  T) 
TeM(t,u ) 


(16) 


Moreover,  from  Eqs.  (14)  and  (16),  we  can  obtain  the  Hessian  matrix  of  Q(Aj;  A/j 
follows: 


dAjdA / 


-  Z  |  Z  Z  Aj)  Fj(t  -  t)  Fj(t  -  t)7 

(k,t,v)eDr0  yueBiv)  reM(t,u) 


Z  Z  r,’v^T;  Aj^ Fj^  ~  T-) 

?(v)  reM(t,u) 

Z  ^  4/)  Fy(f  -  t) 

\ueB(v)  reM(t.u) 

By  Eq.  (17),  for  any  /-dimensional  real  column  vector  we  have 
r32e(4y;4y) 

JC,  - —  X, 

dAjdA / 

=  -  2  (z  2  ^,»(t;^)(x/^-t))2 

(k,t,v)eDr0  T£M(t,u ) 

-  ^  ^  r„v(T;Aj)XjT  Fj(t-T) 

KueB(v)  reM(t,u ) 


(17) 


-  Z  Z  Z 

( k,t,v)eDr0  \  ueB(v )  reM(t,u) 


Xj  Fj(t-T) 


ueB(v )  TeM(t,u ) 

Thus,  by  Eq.  (15),  we  obtain 

t&Q(Aj;Aj) 


-  Z  Z  rM’(T,;  Aj)x/  F j(t  -  r') 

ueB(v)  T'eM(t,u ) 

\ 

l-Z  Z  r'.’'(T;'*j)  Z  Z  r‘^T’Aj)XjT  Fj(t-T) 


1  dAjdA/ 


/  Vwefi(v)  reM(t,u) 


Xj  <  0,  (v*y  e  Ry) , 
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that  is,  the  Hessian  matrix  is  negative  semi-definite.  Hence,  by  solving  the  equation 


dAj 


=  07 


(see  Eq.  (16)),  we  can  find  the  value  of  Aj  that  maximizes  Q  (/l/;  T/j.  We  employed  a 
standard  Newton  Method  in  our  experiments. 


3.2  Model  Selection 

One  of  the  important  purposes  of  introducing  the  TDV  model  is  to  analyze  how  people 
are  affected  by  their  neighbors’  past  opinions  for  a  specific  opinion  formation  process. 
In  what  follows,  for  a  given  set  of  candidate  decay  functions  (i.e.,  feature  vectors),  we 
consider  selecting  one  being  the  most  appropriate  to  the  observed  data  l)Tll  of  \  '-Dr„\  = 
N,  where  N  represents  the  number  of  opinion  manifestations  by  individuals. 

As  mentioned  in  Section  2,  the  base  TDV  model  is  a  special  TDV  model  equipped 
with  the  decay  function  that  equally  weights  all  the  past  opinions.  Thus,  we  first  ex¬ 
amine  whether  or  not  the  TDV  model  equipped  with  a  candidate  decay  function  can 
be  more  appropriate  to  the  observed  data  Dja  than  the  base  TDV  model.3  To  this  end, 
we  employ  the  likelihood  ratio  test.  For  a  given  feature  vector  Fj (At),  let  Aj(F j)  be  the 
maximal  likelihood  estimator  of  the  TDV  model  equipped  with  the  decay  function  of 
Fj(At).  Since  the  base  TDV  model  is  the  TDV  model  of  Aj  -  0 j,  the  log-likelihood 
ratio  statistic  of  the  TDV  model  with  F ./(At)  against  the  base  TDV  model  is  given  by 

Yn(Fj)  =  £(dTo-,Aj(Fj))-£(DTo;0j).  (18) 

It  is  well  known  that  2 Y^(F j)  asymptotically  approaches  to  the  y1  distribution  with 
J  degrees  of  freedom  as  N  increases.  We  set  a  significance  level  a  (0  <  a  <  1),  say 
a  =  0.005,  and  evaluate  whether  or  not  the  TDV  model  with  F j(At)  fits  significantly 
better  than  the  base  TDV  model  by  comparing  2 Y^(F j)  to  yJr.  Here,  yJa  denotes  the 
upper  a  point  of  the  y2  distribution  of  J  degrees  of  freedom,  that  is,  it  is  the  positive 
number  z  such  that 


Tvkv*  i d>' =  l~a- 

where  r(s)  is  the  gamma  function.  We  consider  the  set  T'V  of  the  candidate  feature 
vectors  (i.e.,  decay  functions)  selected  by  this  likelihood  ratio  test  at  significance  level 
a.  Next,  we  find  the  feature  vector  F*r,(At)  e  T'V  such  that  it  maximizes  the  log- 
likelihood  ratio  statistic  Y^(Fj),  ( Fj(At )  6  T'V),  (see  Eq.  (18)),  and  propose  selecting 
the  TDV  model  equipped  with  the  decay  function  of  F*r{At).  If  the  set  T'V  is  empty, 
we  select  the  base  TDV  model  for  Dt0- 


3  The  base  TDV  model  is  not  the  only  baseline  model  with  which  the  proposed  method  is  to  be 
compared.  The  simplest  one  would  be  the  random  opininon  model  in  which  each  user  chooses 
its  opinionn  randomly  independent  of  its  neighbors. 
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Fig.  1.  Results  of  model  selection  validity  for  the  exponential  TDV  model. 


Here  we  recall  that  typical  decay  functions  in  natural  and  social  sciences  include 
the  exponential  decay  function  (see  Eq.  (3))  and  the  power-law  decay  functions  (see 
Eq.  (4)).  We  refer  to  the  TDV  models  of  the  exponential  and  the  power-law  decay  func¬ 
tions  as  the  exponential  TDV  model  and  the  power-law  TDV  model,  respectively.  In  our 
experiments,  we  in  particular  focus  on  investigating  which  of  the  base,  the  exponen¬ 
tial,  and  the  power-law  TDV  models  best  fits  to  the  observed  data  Dt0-  Thus,  the  TDV 
model  to  be  considered  has  /  =  1  and  parameter  A. 


4  Evaluation  by  Synthetic  Data 

Using  synthetic  data,  we  examined  the  effectiveness  of  the  proposed  method  for  pa¬ 
rameter  estimation  and  model  selection.  We  assumed  complete  networks  for  simplicity. 
According  to  the  TDV  model,  we  artificially  generated  an  opinion  diffusion  sequence 
D/n  consisting  of  3-tuple  (k,t,v)  of  opinion  k,  time  t  and  node  v  such  that  | T>ra\  =  N, 
and  applied  the  proposed  method  to  the  observed  data  Dt0,  where  the  significance  level 
a  =  0.005  was  used  for  model  selection.  As  mentioned  in  the  previous  section,  we  as¬ 
sumed  two  cases  where  the  true  decay  follows  the  exponential  distribution  (see  Eq.  (3)) 
and  the  power-law  distribution  (see  Eq.  (3)),  respectively.  Let  YeN  and  denote  the 
log-likelihood  ratio  statistics  of  the  exponential  and  the  power-law  TDV  models  against 
the  base  TDV  model,  respectively  (see  Eq.  (18)).  We  varied  the  value  of  parameter 
A  in  the  following  range:  A  =  0.01,0.03,0.05  for  the  exponential  TDV  model,  and 
A  =  0.4, 0.5, 0.6  for  the  power-law  TDV  model,  on  the  basis  of  the  analysis  performed 
for  the  real  world  @cosme  dataset  (see.  Section  5).  We  conducted  100  trials  varying  the 
observed  data  Dj0  of  \Or„\  -  N,  and  evaluated  the  proposed  method. 

First,  we  investigated  the  model  selection  validity  'Fn / 1 00,  where  'Fn  is  the  number 
of  trials  in  which  the  true  model  was  successfully  selected  by  the  proposed  method. 
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number  of  samples 


Fig.  2.  Results  of  Parameter  estimation  error  for  the  exponential  TDV  model. 


Namely,  if  the  exponential  TDV  model  is  the  true  model,  then  fTyv  is  defined  by  the 
number  of  trials  such  that 

2 YeN  >  max(*l  a,  2 YPN), 

and  if  the  power-law  TDV  model  is  the  true  model,  then  Tn  is  defined  by  the  number 
of  trials  such  that 

2 Ypn  >  max(*l  a,  2 YeN). 

Second,  we  examined  the  parameter  estimation  error  S.y  for  the  trials  in  which  the  true 
model  was  selected  by  the  proposed  method.  Here,  S.y  is  defined  by 

e  |d(A0-T*| 

&N~  T*  ’ 

where  A*  is  the  true  value  of  parameter  A,  and  A(N)  is  the  value  estimated  by  the  pro¬ 
posed  method  from  the  observed  data  rDjn  of  =  N.  Figures  1  and  2  show  the 
results  for  the  exponential  TDV  model,  and  Figures  3  and  4  show  the  results  for  the 
power-law  TDV  model.  Here,  Figures  1  and  3  display  model  selection  validity  'FJv/100 
as  a  function  of  sample  size  N .  Figures  2  and  4  display  parameter  estimation  error  £y 
as  a  function  of  sample  size  N.  As  expected,  Tn  increases  and  Sy  decreases  as  N  in¬ 
creases.  Moreover,  as  A  becomes  larger,  Tn  increases  and  <5V  decreases.  Note  that  a 
large  A  means  quickly  forgetting  past  activities,  and  a  small  A  means  slowly  forgetting 
them.  Thus,  we  can  consider  that  a  TDV  model  of  smaller  A  requires  more  samples 
to  correctly  learn  the  model.  From  Figures  1,  2,  3  and  4,  we  observe  that  the  proposed 
method  can  work  almost  perfectly  when  N  is  greater  than  500,  and  A  is  greater  than  0.01 
for  the  exponential  TDV  model  and  greater  than  0.4  for  the  power-law  TDV  model. 
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number  of  samples 


Fig.  3.  Results  of  model  selection  validity  for  the  power-law  TDV  model. 


5  Findings  in  Opinion  Formation  on  Social  Media 

5.1  Dataset 

We  collected  real  data  from  “@cosme”  4,  which  is  a  Japanese  word-of-mouth  commu¬ 
nication  website  for  cosmetics.  In  @cosme,  a  user  can  post  a  review  and  give  a  score  of 
each  brand  (one  from  1  to  7).  When  one  user  registers  another  user  as  his/her  favorite 
user,  a  “fan-link”  is  created  between  them.  We  traced  up  to  ten  steps  in  the  fan-links 
from  a  randomly  chosen  user  in  December  2009,  and  collected  a  set  of  ( b ,  k,  t,  v)’s, 
where  ( b ,  k,  t,  v)  means  that  user  v  scored  brand  b  k  points  at  time  t.  The  number  of 
brands  was  7,139,  the  number  of  users  was  45,024,  and  the  number  of  reviews  posted 
was  331,084.  For  each  brand  b,  we  regarded  the  point  k  scored  by  a  user  v;  as  the 
opinion  k  of  v,  and  constructed  the  opinion  diffusion  sequence  D/Jb)  consisting  of 
3-tuple  ( k ,  t,  v).  In  particular,  we  focused  on  these  brands  in  which  the  number  of  sam¬ 
ples  N  =  \'DrJb)\  was  greater  than  500.  Then,  the  number  of  brands  was  120.  We  refer 
to  this  dataset  as  the  @cosme  dataset. 


5.2  Results 

We  applied  the  proposed  method  to  the  @cosme  dataset.  Again,  we  adopted  the  tempo¬ 
ral  decay  voter  models  with  the  exponential  and  the  power-law  distributions,  and  used 
the  significance  level  a  =  0.005  for  model  selection.  There  were  9  brands  such  that 
2  Y‘n  >  xx  n.i  and  93  brands  such  that  2 >Xla-  Here,  in  the  same  way  as  the  previous 
section,  YeN  and  denote  the  log-likelihood  ratio  statistics  of  the  exponential  and  the 
power-law  TDV  models  against  the  base  TDV  model,  respectively.  Further,  there  were 

4  http://www.cosme.net/ 
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Fig.  4.  Results  of  Parameter  estimation  error  for  the  power-law  TDV  model. 


92  brands  such  that  2 >  max  [x{  a,  2 Y^j,  one  brand  such  that  2 YeN  >  max  (x^  a,  2T^, 

and  27  brands  such  that  max  (2 Y^,  2K'vj  <  X\  „■  Namely,  according  to  the  proposed 
method,  92  brands  were  the  power-law  TDV  model,  27  brands  were  the  base  TDV 
model,  and  only  one  brand  was  the  exponential  TDV  model.  These  results  show  that 
most  brands  conform  to  the  power-law  TDV  model.  This  also  agrees  with  the  work  [1, 
19]  that  many  human  actions  are  related  to  power-laws. 

Figures  5  and  6  show  the  results  for  the  @cosme  dataset  from  the  point  of  view 
of  the  power-law  TDV  model.  Figure  5  plots  the  log-likelihood  ratio  statistic  Y^  for 
each  brand  as  a  function  of  sample  size  N,  where  the  thick  solid  line  indicates  the  value 
of  Xi„-  In  addition  to  the  brands  plotted,  there  is  a  brand  such  that  Y^  —  Y‘  =  0. 
It  was  brand  “YOJIYA”,  which  is  a  traditional  Kyoto  brand,  and  is  known  as  a  brand 
releasing  new  products  less  frequently.  Thus,  we  speculate  that  it  conforms  to  the  base 
TDV  model.  Figure  6  plots  the  pair  ( Y^,  TW))  for  the  brands  in  which  the  power-law 
TDV  model  was  selected  by  the  proposed  method,  where  A(N)  is  the  value  of  parameter 
A  estimated  by  the  proposed  method  from  the  observed  data  Dr0(b)  of  \Dj0{b)\  =  N. 
From  Figure  6,  we  observe  that  K]’.  and  A(N)  are  positively  correlated.  This  agrees 
with  the  fact  that  the  power-law  TDV  model  with  A  =  0  corresponds  to  the  base  TDV 
model.  In  Figures  5  and  6,  the  big  solid  red  circle  indicates  the  brand  “LUSH-JAPAN”, 
which  had  the  largest  values  of  Y^,,  A(N )  and  N,  respectively.  We  also  find  the  big 
solid  green  triangle  in  Figure  5  as  a  brand  that  had  a  large  value  of  Y ^  and  a  relatively 
small  value  of  N.  This  was  the  brand  “SHISEIDO  ELIXIR  SUPERIEUR”,  which  had 
the  seventh  largest  value  of  Y^,  N  -  584,  and  /1(A)  =  0.58.  Note  that  these  brands 
“LUSH-JAPAN”  and  “SHISEIDO  ELIXIR  SUPERIEUR”  are  known  as  brands  that 
were  recently  established  and  release  new  products  frequently.  Thus,  we  speculate  that 
they  conform  to  the  power-law  TDV  model  with  large  A. 
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Fig.  5.  Log-likelihood  ratio  statistic  Y £  and  number  of  samples  N  for  the  @cosme  dataset. 


6  Conclusion 

We  addressed  the  problem  of  how  people  make  their  own  decisions  based  on  their 
neighbors’  opinions.  The  model  best  suited  to  discuss  this  problem  is  the  voter  model 
and  several  variants  of  this  model  have  been  proposed  and  used  extensively.  However, 
all  of  these  models  assume  that  people  use  their  neighbors’  latest  opinions.  People 
change  opinions  over  time  and  some  opinions  are  more  persistent  and  some  others 
are  less  persistent.  These  depend  on  many  factors  but  the  existing  models  do  not  take 
this  effect  into  consideration.  In  this  paper,  we,  in  particular,  addressed  the  problem  of 
how  people’s  opinions  are  affected  by  their  own  and  other  peoples’  opinion  histories.  It 
would  be  reasonable  to  assume  that  older  opinions  are  less  influential  and  recent  ones 
are  more  influential.  Based  on  this  assumption,  we  devised  a  new  voter  model,  called  the 
temporal  decay  voter  (TDV)  model  which  uses  all  the  past  opinions  in  decision  making 
in  which  decay  is  assumed  to  be  a  linear  combination  of  representative  decay  functions 
each  with  different  decay  factors.  The  representative  functions  include  the  linear  decay, 
the  exponential  decay,  the  power-law  decay  and  many  more.  Each  of  them  specifies  only 
the  form  and  the  parameters  remain  unspecified.  We  formulated  this  as  a  machine  learn¬ 
ing  problem  and  solved  the  following  two  problems:  1)  Given  the  observed  sequence 
of  people’s  opinion  manifestation  and  an  assumed  decay  function,  learn  the  parameter 
values  of  the  function  such  that  the  corresponding  TDV  model  best  explains  the  obser¬ 
vation,  and  2)  Given  a  set  of  decay  functions  each  with  the  optimal  parameter  values, 
choose  the  best  model  and  refute  others.  We  solved  the  former  problem  by  maximiz¬ 
ing  the  likelihood  and  derived  an  efficient  parameter  updating  algorithm,  and  the  latter 
problem  by  choosing  the  decay  model  that  maximizes  the  log  likelihood  ratio  statistic. 
We  first  tested  the  proposed  algorithms  by  synthetic  datasets  assuming  that  there  are 
two  decay  models:  the  exponential  decay  and  the  power-law  decay.  We  confirmed  that 
the  learning  algorithm  correctly  identifies  the  parameter  values  and  the  model  selection 
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Fig.  6.  Log-likelihood  ratio  statistic  Y %  and  estimated  parameter  value  A(N)  for  the  @cosme 
dataset. 

algorithm  correctly  identifies  which  model  the  data  came  from.  We  then  applied  the 
method  to  the  real  opinion  diffusion  data  taken  from  a  Japanese  word-of-mouth  com¬ 
munication  site  for  cosmetics.  We  used  the  two  decay  functions  above  and  added  no 
decay  function  as  a  baseline.  The  result  of  the  analysis  revealed  that  opinions  of  most 
of  the  brands  conform  to  the  TDV  model  of  the  power-law  decay  function.  We  found 
this  interesting  because  this  is  consistent  with  the  observation  that  many  human  ac¬ 
tions  are  related  to  the  power-law.  Some  brands  showed  behaviors  characteristic  to  the 
brands,  e.g.,  the  older  brand  that  releases  new  product  less  frequently  naturally  follows 
no  decay  TDV  and  the  newer  brand  that  releases  new  product  more  frequently  natu¬ 
rally  follows  the  power-law  decay  TDV  with  large  decay  constant,  which  are  all  well 
interpretable. 
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Abstract.  We  propose  a  method  of  detecting  the  period  in  which  a  burst  of  in¬ 
formation  diffusion  took  place  from  an  observed  diffusion  sequence  data  over  a 
social  network  and  report  the  results  obtained  by  applying  it  to  the  real  Twitter 
data.  We  assume  a  generic  information  diffusion  model  in  which  time  delay  as¬ 
sociated  with  the  diffusion  follows  the  exponential  distribution  and  the  burst  is 
directly  reflected  to  the  changes  in  the  time  delay  parameter  of  the  distribution 
(inverse  of  the  average  time  delay).  The  shape  of  the  parameter  change  is  ap¬ 
proximated  by  a  series  of  step  functions  and  the  problem  of  detecting  the  change 
points  and  finding  the  values  of  the  parameter  is  formulated  as  an  optimization 
problem  of  maximizing  the  likelihood  of  generating  the  observed  diffusion  se¬ 
quence.  Time  complexity  of  the  search  is  almost  proportional  to  to  the  number 
of  observed  data  points  (possible  change  points)  and  very  efficient.  We  apply  the 
method  to  the  real  Twitter  data  of  the  2011  To-hoku  earthquake  and  tsunami,  and 
show  that  the  proposed  method  is  by  far  efficient  than  a  naive  method  that  adopts 
exhaustive  search,  and  more  accurate  than  a  simple  greedy  method.  Two  inter¬ 
esting  discoveries  are  that  a  burst  period  between  two  change  points  detected  by 
the  proposed  method  tends  to  contain  massive  homogeneous  tweets  on  a  specific 
topic  even  if  the  observed  diffusion  sequence  consists  of  heterogeneous  tweets  on 
various  topics,  and  that  assuming  the  information  diffusion  path  is  a  line  shape 
tree  can  give  a  good  approximation  of  the  maximum  likelihood  estimator  when 
the  actual  diffusion  path  is  not  known. 


1  Introduction 

Recent  technological  innovation  and  popularization  of  high  performance  mobile/smart 
phones  has  changed  our  communication  style  drastically  and  the  use  of  various  social 
media  such  as  Twitter  and  Facebook  has  been  affecting  our  daily  lives  substantially.  In 
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these  social  media,  information  propagates  through  the  social  network  formed  based  on 
friendship  relations.  Especially,  Twitter,  micro-blog  in  which  the  number  of  characters 
is  limited  to  140,  is  now  very  popular  among  the  young  generation  due  to  its  handiness 
and  easiness  of  usage,  and  it  is  fresh  to  our  memory  that  Twitter  played  a  very  important 
role  as  the  information  infrastructure  during  the  recent  natural  disaster,  both  domestic 
and  abroad,  including  the  201 1  To-hoku  earthquake  and  tsunami  in  Japan. 

In  these  social  networks,  there  have  been  proposed  several  measures,  called  central¬ 
ity,  that  characterize  nodes  in  the  network  based  on  the  structure  of  the  network  [11, 
1,3],  While  such  centrality  measures  can  be  used  to  identify  those  nodes  that  play  an 
important  role  in  diffusing  information  over  the  network,  it  has  also  been  shown  that 
measures  based  solely  on  the  network  structure  are  not  good  enough  to  a  such  problem 
of  influence  maximization  [11,1,3]  in  which  the  task  is  to  identify  a  limited  number  of 
nodes  which  together  maximizes  the  information  spread  and  that  explicit  use  of  infor¬ 
mation  diffusion  mechanism  is  essential  [5],  In  general,  the  mechanism  is  represented 
by  a  probabilistic  diffusion  model.  Most  representative  and  basic  ones  are  the  Indepen¬ 
dent  Cascade  (IC)  model  [2, 4]  and  the  Linear  Threshold  (LT)  model  [12, 13]  including 
their  extended  versions  that  explicitly  handle  asynchronous  time  delay.  Asynchronous 
time  delay  Independent  Cascade  (AsIC)  model  [8]  and  Asynchronous  time  delay  Lin¬ 
ear  Threshold  (AsLT)  model  [9],  In  fact,  the  nodes  and  links  that  are  identified  to  be 
influential  using  these  models  are  substantially  different  from  those  identified  by  the 
existing  centrality  measures. 

In  reality,  we  observe  that  the  information  on  a  certain  topic  propagates  explosively 
for  a  very  short  period  of  time.  Because  such  information  affects  our  behaviour  strongly, 
it  is  important  to  understand  the  observed  event  in  a  timely  manner.  This  brings  in  an 
important  and  interesting  problem  which  is  to  accurately  and  efficiently  detect  the  burst 
from  the  observed  information  diffusion  data  and  to  identify  what  caused  this  burst  and 
how  long  it  persisted.  Any  of  the  above  mentioned  probabilistic  models  cannot  handle 
this  kind  of  problem  because  they  assume  that  information  diffuses  in  a  stationary  en¬ 
vironment,  i.e.  model  parameters  are  stationary.  Zhu  and  Shasha  [14]  approached  this 
problem  without  relying  on  a  diffusion  model.  They  detected  a  burst  period  for  a  target 
event  by  counting  the  number  of  its  occurrences  in  a  given  time  window  and  check¬ 
ing  whether  it  exceeds  a  predetermined  threshold  or  not.  Kleinberg  [6]  challenged  this 
problem  using  a  hidden  Markov  model  in  which  bursts  appear  naturally  as  state  tran¬ 
sitions,  and  successfully  identified  the  hierarchical  structure  of  e-mail  messages.  Sun 
et  al.  [10]  extended  Kleinberg’s  method  so  as  to  detect  correlated  burst  patterns  from 
multiple  data  streams  that  co-evolve  over  time. 

We  handle  this  problem  by  assuming  that  parameters  in  the  diffusion  model  changed 
due  to  unknown  external  environmental  factors  and  devise  an  efficient  algorithm  that 
accurately  detects  the  changes  in  the  parameter  values  from  a  single  observed  diffusion 
data  sequence.  In  particular  we  note  that  the  parameter  related  to  the  time  delay  is 
most  crucial  in  the  burst  detection  and  focus  on  detecting  the  changes  in  the  time  delay 
parameter  that  defines  the  delay  distribution.  We  modeled  the  time  delay  in  AsIC  and 
AsLT  models  by  the  exponential  distribution,  thus  we  do  the  same  in  this  paper.  This 
corresponds  to  associating  the  burst  with  the  information  diffusion  with  a  shorter  time 
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delay.  By  focusing  only  on  this  time  delay,  we  can  devise  a  generic  algorithm  that  does 
not  depend  on  a  specific  information  diffusion  model,  e.g.  be  it  either  AsIC  or  AsLT. 

More  precisely,  we  assume  that  time  delay  parameter  changes  are  approximated  by 
a  series  of  step  functions  and  propose  an  optimization  algorithm  that  maximizes  the 
likelihood  ratio  that  is  the  ratio  of  the  likelihood  of  observing  the  data  assuming  the 
time  delay  parameter  changes  (change  points  and  parameter  values  between  the  suc¬ 
cessive  change  points)  to  the  likelihood  of  observing  the  data  assuming  that  there  is  no 
changes  in  the  time  delay  parameter.  The  algorithm  is  based  on  iterative  search  based  on 
recursive  splitting  with  delayed  backtracking,  and  requires  no  predetermined  threshold. 
The  time  complexity  is  almost  proportional  to  the  number  of  observed  data  points  (can¬ 
didates  of  possible  change  points).  We  apply  the  method  to  the  Twitter  data  observed 
during  the  2011  To-hoku  earthquake  and  tsunami  and  confirm  that  the  proposed  method 
can  efficiently  and  accurately  detect  the  change  points.  We  further  analyze  the  content 
of  the  tweets  and  report  the  discovery  that  even  use  of  the  diffusion  sequence  data  of  the 
same  user  ID  (not  necessarily  the  data  on  a  specific  topic)  allows  us  to  identify  that  a 
specific  topic  is  talked  intensively  around  the  beginning  of  the  period  where  the  burst  is 
detected,  and  the  assumption  we  made  that  the  information  diffusion  path  is  a  line  shape 
tree  gives  a  good  approximation  of  the  maximum  likelihood  estimator  in  this  problem 
setting.  Finally,  we  discuss  that  although  the  detected  change  points  do  not  correspond 
exactly  to  nodes  in  a  social  network  that  caused  the  burst  period,  the  detected  change 
points  are  useful  to  find  such  nodes  because  we  can  limit  nodes  to  be  considered  by 
focusing  on  those  around  them. 

The  paper  is  organized  as  follows.  Section  2  briefly  describes  the  framework  of  in¬ 
formation  diffusion  model  on  which  our  problem  setting  is  based.  Section  3  elucidates 
the  problem  setting,  and  Section  4  describes  the  change  point  detection  method  includ¬ 
ing  two  other  methods  that  are  used  for  comparison.  Section  5  reports  experimental 
results  using  real  Twitter  data.  Section  6  summarizes  what  has  been  achieved  in  this 
work  and  addresses  the  future  work. 


2  Information  Diffusion  Model  Framework 

We  consider  information  diffusion  over  a  social  network  whose  structure  is  defined  as 
a  directed  graph  G  -  (V,E),  where  V  and  E  ( c  V  x  V)  represent  a  set  of  all  nodes 
and  a  set  of  all  links,  respectively.  Suppose  that  we  observe  a  sequence  of  information 
diffusion  C  =  {(vo,  to),  (vi,h),  ■  ■  ■  ,  (vj vJn)}  that  arose  from  the  information  released 
at  the  source  node  Vo  at  time  to-  Here,  v„  is  a  node  where  the  information  has  been 
propagated  and  t„  is  its  time.  We  assume  that  the  time  points  are  ordered  such  that 
4-i  <  4  for  any  n  e  {1,  •  •  •  N ).  We  further  assume,  as  a  standard  setting,  that  the  actual 
information  diffusion  paths  of  a  sequence  C  correspond  to  a  tree  that  is  embedded  in  the 
directed  graph  G  representing  the  social  network[7],  i.e.,  the  parent  node  which  passed 
the  information  to  a  node  v„  is  uniquely  identified  to  be  iy„,.  Here,  p{n)  is  a  function 
that  returns  the  node  identification  number  of  the  parent  of  the  node  v„  in  the  range  of 
{0,  •  •  •  ,n-\}. 

The  information  diffusion  model  we  consider  here  is  any  model  that  explicitly  in¬ 
corporates  the  concept  of  asynchronous  time  delay  such  as  AsIC  model  [8]  and  AsLT 
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model  [9]  in  contrast  to  the  traditional  IC  model  [2, 4]  and  LT  model  [12, 13]  that  do  not 
consider  the  time  delay.  Said  differently,  it  is  a  model  that  allows  any  real  value  for  the 
time  tn  at  which  the  information  has  been  propagated  to  a  node  v„  and  assumes  a  certain 
probability  distribution  for  the  time  delay  t„  -  tp(nj .  In  this  paper,  we  use  the  exponential 
distribution  for  the  time  delay,  but  any  other  distribution  such  as  power  law  is  feasible 
exactly  in  the  same  way. 


3  Problem  Settings 

In  this  section  we  formally  define  the  change  point  detection  problem.  As  mentioned 
in  Section  1,  we  assume  that  some  unknown  change  took  place  in  the  course  of  infor¬ 
mation  diffusion  and  what  we  observe  is  a  sequence  of  information  diffusion  of  some 
topic  in  which  the  change  is  encapsulated.  Thus,  our  goal  is  to  detect  each  change  point 
and  how  long  the  change  persisted  from  there.  Note  that  we  basically  pay  attention  to  a 
diffusion  sequence  of  a  certain  topic.  From  our  previous  result  that  people’s  behaviors 
are  quite  similar  when  talking  the  same  topic  [8,9],  we  can  assume  that  the  time  de¬ 
lay  parameter  ru  v  which  is  in  principle  defined  for  each  link  ( u ,  v)  e  E  takes  a  uniform 
value  regardless  of  the  link  it  passes  through.  In  other  word,  we  set  ru  v  =  r  (V(u,  v)  e  E ) 
and  thus,  the  time  delay  of  information  diffusion  is  represented  by  the  following  simple 
exponential  distribution  p(tn  -  tp(ny,  r )  =  rexp(-r(tn  -  tp(n))'). 

With  this  preparation,  we  mathematically  define  the  change  point  detection  prob¬ 
lem.  Let’s  assume  that  we  observe  a  set  of  time  points  of  information  diffusion  sequence 
T)  =  [fo,  ,  fjv).  Let  the  time  of  the  y-th  change  point  be  Tj  (to  <  Tj  <  t ti).  The  de¬ 
lay  parameter  that  the  distribution  follows  switches  from  r,  to  r]+\  at  the  y-th  change 
point  Tj.  Namely,  we  are  assuming  a  series  of  step  functions  as  a  shape  of  parame¬ 
ter  changes.  Let  the  set  comprising  J  change  points  he  Sj  =  {T\,  •  •  •  ,  Tj],  and  we  set 
To  =  to  and  T J+\  —  tN  for  the  sake  of  convenience  (T;  |  <  T j).  Let  the  division  of  D  by 
Sj  by  Dj  =  7y_i  <  tn  <  Tj },  i.e.,  D  =  [to]  U  D\  U  •  •  •  U  Dj+i ,  and  |D;|  represents 
the  number  of  observed  points  in  (7y_i,7y].  Here,  we  request  that  | Dj\  4-  0  for  any 
j  €  [1,  •  •  •  ,/+l]  and  there  exists  at  least  one  t„  and  tn  €  £),  is  satisfied. 

The  log-likelihood  for  the  D,  given  a  set  of  change  points  Sj,  is  calculated,  by 
defining  the  parameter  vector  ry+i  =  (r ] ,  •  •  •  ,  rJ+l),  as  follows. 


7+1 

L(D-rJ+1,Sj)  =  log  P|  P|  r,  exp(-r,(f„  -  tp(n))) 

7=1 

j+ 1  y+i 

=  Yj  ^ log  rJ  ~YrjY (f" _  tp<n>)-  (1) 

7=1  7=1  t„eD  j 

Thus,  the  maximum  likelihood  estimate  of  the  parameter  of  Equation  (1)  is  given  by 


rj 


Yf\  i  L 

'  J'  t  c<T\  ■ 


(2) 
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Further,  substituting  Equation  (2)  to  Equation  (1)  leads  to 


L(D-rJ+uSj)  =  -N  ■ 


7+1 

7=1 


1 

l®7l 


t„eDj 


\rr\  I  /  Stn  tP(n^ 


(3) 


Therefore,  the  change  point  detection  problem  is  reduced  to  the  problem  of  finding  the 
change  point  set  Sj  that  maximizes  Equation  (3).  However,  Equation  (3)  alone  does 
not  allow  us  to  directly  evaluate  the  effect  of  introducing  S  r  We,  thus,  reformulate  the 
problem  as  the  maximization  problem  of  log-likelihood  ratio.  If  we  do  not  assume  any 
change  point,  i.e..  So  =  0,  Equation  (3)  is  reduced  to 


L(V-,h,S 0)  =  -N  -  N log 


N 


Ifi  n  tp  ,(„)) 


(4) 


Thus,  the  log-likelihood  ratio  of  the  case  where  we  assume  J  change  points  and  the  case 
where  we  assume  no  change  points  is  given  by 


LR(Sj)  =  L(D;  r  j+1,Sj)  -  L(D ;  ruS0) 

.  ,  N  w+l 

=  AHog 


1  1 

_  bin))  ~  ^  I®/!  l°g  ^  )b  -  bin)) 


n=  1 


7=1 


i„eDj 


(5) 


We  consider  the  problem  of  finding  the  set  of  change  points  Sj  that  maximizes  LR(S j) 
defined  by  Equation  (5). 

We  note  that,  in  general,  it  is  conceivable  that  we  are  not  able  to  acquire  the  complete 
tree  structure  of  the  diffusion  sequence  data.  Thus,  here,  we  consider  two  extreme  cases, 
one  in  which  the  information  spreads  fastest  (star  shape  tree)  and  the  other  in  which  the 
information  spread  slowest  (line  shape  tree).  The  function  which  defines  the  parent  node 
becomes  p(n )  =  0  for  the  former  and  p{n)  =  n  —  1  for  the  latter.  In  case  where  there 
is  no  change  point,  the  maximum  likelihood  estimator  is  r_l  =  (t\  +  •  •  •  +  t^)/N  -  to 
for  the  former  and  r  -  (t^  -  Iq)/N  for  the  latter.  While  we  conjecture  that  in  reality 
the  optimal  value  lies  in  between  these  two  extreme  values,  under  the  assumption  that 
the  actual  tree  structure  of  the  diffusion  data  is  unknown,  we  consider  to  approximate 
the  optimal  value  by  using  either  one  of  them.  Here,  note  that  in  the  former  case,  the 
maximum  likelihood  estimator  represents  the  average  diffusion  delay  time  between  the 
source  node  vo  and  each  node  v,-  which  is  assumed  to  be  connected  to  vo  by  a  direct 
link,  while  in  the  latter  case,  it  represents  the  average  time  interval  between  successive 
observation  time  points.  Considering  that  the  burst  period  we  want  to  detect  is  much 
shorter  than  the  other  non  burst  periods,  the  latter  case  (line  shape  tree)  seems  to  be 
more  suitable  for  our  aim.  Therefore,  LR(Sj)  defined  by  Equation  (5)  becomes 

Of(5,)  =  «log(^)-2lOj]og(E^il).  (6) 

We  compared  the  bursts  detected  by  using  the  two  extreme  values,  and  found  that  the 
use  of  line  shape  tree  gave  a  better  results  and  decided  to  use  Equation  (6)  in  our  exper¬ 
iments. 
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4  Change  Points  Detection  Method 

We  consider  the  problem  of  detecting  change  points  as  a  problem  of  finding  a  subset 
Sj  c  D  when  the  set  of  time  points  of  information  diffusion  result  1)  -  {to,  t\ ,  •  •  •  ,  tN ) 
and  the  number  of  change  points  J  are  given.  In  other  words,  we  search  for  J  time 
points  that  are  most  likely  to  be  the  change  points  from  a  sequence  of  N  observation 
points.  In  what  follows,  we  explain  each  of  the  three  methods,  naive  method  fan  ex¬ 
haustive  search),  simple  method  (a  greedy  search),  and  the  proposed  method  that  is  a 
combination  of  a  greedy  search  and  a  local  search. 

4.1  Naive  Method 

The  simplest  method  is  to  exhaustively  search  for  the  best  set  of  J  change  points  Sj. 
Clearly  the  time  complexity  of  this  naive  approach  is  0(NJ).  Thus,  the  number  of 
change  points  detectable  would  be  limited  to  J  -  2  in  order  for  the  solution  to  be 
obtained  in  a  reasonable  amount  of  computation  time  when  N  is  large  enough. 

4.2  Simple  Method 

We  describe  the  simple  method  which  is  applicable  when  the  number  of  change  points 
J  is  large.  This  is  a  progressive  binary  splitting  without  backtracking.  We  fix  the  already 
selected  set  of  (j  -  1)  change  points  Sj-  \  and  search  for  the  optimal  y'-th  change  point 
T j  and  add  it  to  Ss  \ .  We  repeat  this  procedure  from  j  =  1  to  J. 

The  algorithm  is  given  below. 

Stepl.  Initialize  j  =  1,  So  =  0. 

Step2.  Search  for  Tj  =  arg m-dxtri€£,{LR(Sj- \  U  {?„))). 

Step3.  Update  Sj  -  Sj- \  U  {Tj}. 

Step4.  If  j  -  J ,  output  Sj  and  stop. 

Step5.  j  —  j  +  1,  and  return  to  Step2. 

Here  note  that  in  Step3  elements  of  the  change  point  set  Sj  are  reindexed  to  satisfy 
Tj-\  <  Tj  for  i  =  2,  •  •  •  j.  Clearly,  the  time  complexity  of  the  simple  method  is  O(NJ) 
which  is  fast.  Thus,  it  is  possible  to  obtain  the  result  within  a  allowable  computation 
time  for  a  large  N.  However,  since  this  is  a  greedy  algorithm,  it  can  be  trapped  easily  to 
a  poor  local  optimal. 

4.3  Proposed  Method 

We  propose  a  method  which  is  computationally  almost  equivalent  to  the  simple  method 
but  gives  a  solution  of  much  better  quality.  We  start  with  the  solution  obtained  by  the 
simple  method  Sj ,  pick  up  a  change  point  Tj  from  the  already  selected  points,  fix  the 
rest  S j  \  {Tj}  and  search  for  the  better  value  T'  of  Tj,  where  ■  \  ■  represents  set  difference. 
We  repeat  this  from  j  =  1  to  J.  If  no  replacement  is  possible  for  all  j  (j  —  1,  •  •  •  J),  i.e. 
T'j  =  Tj  for  all  j,  no  better  solution  is  expected  and  the  iteration  stops. 

The  algorithm  is  given  below. 
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Stepl.  Find  5 j  by  the  simple  method  and  initialize  j  —  1 ,  £  =  0. 

Step2.  Search  for  T'.  =  argma x,iiEz){LR(Sj  \  {T j)  U  {r„})}. 

Step3.  If  T'j  —  Tj,  set  k  =  k  +  1,  otherwise  set  k  =  0,  and  update Sj  =  Sj\{Tj]  U  {71'}. 
Step4.  If  k  -  J,  output  Sj  and  stop. 

Step5.  If  j  =  ./,  set  j  =  1,  otherwise  set  j  =  j  +  1,  and  return  to  Step2. 

It  is  evident  that  the  proposed  method  requires  computation  time  several  times  larger 
than  that  of  the  simple  method,  but  it  is  much  less  than  that  of  the  naive  method.  How 
much  the  computation  time  increases  compared  to  the  simple  method  and  how  much  the 
solution  quality  increases  await  for  the  experimental  evaluation,  which  we  will  report 
in  Section  5. 


5  Experimental  Evaluation 


We  experimentally  evaluate  the  computation  time  and  the  accuracy  of  the  change  point 
detection  using  the  real  world  Twitter  information  diffusion  sequence  data  based  on 
the  methods  we  described  in  the  previous  section.  We,  then,  analyze  in  depth  the  top 
6  diffusion  sequences  in  terms  of  the  log-likelihood  ratio  based  on  the  detected  change 
points  and  burst  periods,  show  that  the  line  shape  tree  approximation  is  much  better  than 
the  star  shape  tree  approximation,  and  investigate  whether  or  not  we  are  able  to  identify 
which  node  in  a  social  network  caused  the  burst  from  the  detected  change  points. 


5.1  Experimental  Settings 

The  information  diffusion  data  we  used  for  evaluation  are  extracted  from  201,297,161 
tweets  of  1,088,040  Twitter  users  who  tweeted  at  least  200  times  during  the  three  weeks 
from  March  5  to  24,  2011  that  includes  March  11,  the  day  of  201 1  To-hoku  earthquake 
and  tsunami.  It  is  conceivable  to  use  a  retweet  sequence  in  which  a  user  sends  out  other 
user’s  tweet  without  any  modification.  But  there  exists  multiple  styles  of  retweeting 
(official  retweet  and  unofficial  retweet),  and  it  is  very  difficult  to  accurately  extract  a 
sequence  of  tweets  in  an  automatic  manner  considering  all  of  these  different  styles. 
Therefore,  in  our  experiments,  noting  that  each  retweet  includes  the  ID  of  the  user  who 
sent  out  the  original  tweet  in  the  form  of  “@ID”,  we  extracted  tweets  that  include  @ID 
format  of  each  user  ID  and  constructed  a  sequence  data  for  each  user.  More  precisely, 
we  used  information  diffusion  sequences  of  798  users  for  which  the  length  of  sequences 
are  more  than  5,000  (number  of  tweets).  Note  that  each  diffusion  sequence  includes 
retweet  sequences  on  multiple  topics.  Since  we  do  not  know  the  ground  truth  of  the 
change  points  for  each  sequence  if  there  are  changes  in  it,  we  used  the  naive  method 
which  exhaustively  search  for  all  the  possible  combinations  of  the  change  points  as 
giving  the  ground  truth.  We  had  to  limit  the  number  of  change  points  to  2  (J  -  2)  in 
order  for  the  naive  method  to  return  the  solution  in  a  reasonable  amount  of  computation 
time.  The  experimental  results  explained  in  the  next  subsection  is  obtained  by  using  a 
machine  with  Intel(R)  Xeon(R)  CPU  W5590  @3.33GHz  and  32GB  memory. 
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Fig.  1.  Comparison  of  computation  time  among  the  three  (naive,  simple,  and  proposed)  methods. 


5.2  Main  Results 

Performance  Evaluation  Figure  1  shows  the  computation  time  that  each  method 
needed  to  produce  the  results.  The  horizontal  axis  is  the  length  of  the  information  dif¬ 
fusion  data  sequences,  and  the  vertical  axis  is  the  computation  time  in  second.  The 
results  clearly  indicate  that  the  naive  method  requires  the  largest  computation  time. 
The  computation  time  is  quadratic  to  the  sequence  length  as  predicted.  In  contrast,  the 
computation  time  for  the  simple  and  the  proposed  methods  is  much  shorter  and  it  in¬ 
creases  almost  linearly  to  the  increase  of  the  sequence  length  for  both.  The  proposed 
method  requires  more  computation  time  due  to  the  extra  iteration  needed  for  delayed 
backtracking.  In  fact,  the  number  of  extra  iteration  is  2.2  on  the  average  and  7  at  most. 

Figure  2  shows  the  accuracy  of  the  detected  change  points.  We  regarded  that  the 
solution  obtained  by  the  naive  method  is  the  ground  truth.  The  horizontal  axis  is  the 
sequence  ranking  of  the  log-likelihood  ratio  for  the  naive  method  (ranked  from  the  top 
to  the  last),  and  the  vertical  axis  is  the  logarithm  of  the  likelihood  ratio  of  the  solution  of 
each  method.  The  results  indicate  that  the  simple  method  has  lower  likelihood  ratio  for 
all  the  range,  meaning  that  it  detects  change  points  which  are  different  from  the  optimal 
ones,  but  the  proposed  method  can  detect  the  correct  optimal  change  points  except  for 
the  low  ranked  sequences  for  which  the  likelihood  ratio  is  small  as  is  evident  from 
the  result  in  that  the  red  curve  representing  the  proposed  method  is  indistinguishable 
from  the  blue  curve  representing  the  naive  method.  The  reason  why  the  accuracy  of  the 
proposed  method  for  sequences  with  low  likelihood  decreases  may  be  because  the  burst 
period  is  not  clear  for  these  sequences.  In  summary,  out  of  the  798  sequences  in  total, 
the  proposed  method  gave  the  correct  results  for  713  sequences  (98.4%),  whereas  the 
simple  method  gave  the  correct  results  for  only  171  sequences  (21.4%).  The  average 
ratio  of  the  likelihood  ratio  of  the  proposed  method  to  that  of  the  naive  method  (optimal 
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Fig.  2.  Comparison  of  accuracy  among  the  three  (naive,  simple,  and  proposed)  methods. 


solution)  is  0.976,  whereas  the  corresponding  ratio  for  the  simple  method  is  0.881, 
revealing  that  the  proposed  method  gives  much  closer  ratio  to  the  optimal  likelihood 
ratio.  These  results  confirm  that  the  proposed  method  can  increase  the  change  point 
detection  accuracy  to  a  great  extent  compared  to  the  simple  method  with  only  a  small 
penalty  for  the  increased  computation  time. 


In  Depth  Analyses  on  Detected  Change  Points  and  Burst  Periods  Next,  we  had  a 
closer  look  at  the  top  6  diffusion  sequences  in  terms  of  the  log-likelihood  ratios.  Table  1 
shows  the  total  number  of  tweets  included  in  the  sequence,  the  starting  and  the  ending 
time  of  the  burst  period,  and  the  main  topics  that  appeared  near  the  beginning  of  the 
burst.  Figure  3  shows  how  the  cumulative  number  of  tweets  increases  as  time  goes 
for  each  diffusion  sequence.  The  horizontal  axis  is  time  and  the  vertical  axis  is  the 
cumulative  number  of  tweets.  The  two  red  vertical  lines  in  the  graph  are  the  change 
(starting  and  ending)  points  detected  by  the  proposed  method,  and  the  interval  between 
them  is  the  burst  period. 

As  is  understood  from  Table  1,  explosive  retweeting  of  the  information  of  urgent 
need  about  the  earthquake  for  a  short  period  of  time  triggered  the  start  of  the  burst 
(with  the  exception  of  the  4th  ranked  sequence).  The  4th  ranked  sequence  is  for  the 
account  called  “ordinary  timeline”  which  was  set  up  for  allowing  to  tweet  everyday 
topics  by  adding  “@itsumonoTL”  at  the  beginning  of  the  tweet  when  people  are  in 
voluntary  restraint  mood  after  the  disastrous  earthquake.  We  can  say,  with  the  exception 

1  NHK  is  the  government  operated  broadcaster. 

2  Great  Hanshin-Awaji  Earthquake  occurred  on  January  17,  1995  in  Kobe  area  and  6,434  people 
lost  their  lives. 
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Table  1.  Major  topics  appearing  at  the  beginning  of  the  burst  periods  of  the  top  6  diffusion  results 
in  terms  of  log-likelihood  ratio 


Ranking 

Length 

Detected  burst  period 

Major  topics  at  the  beginning  of  the  burst  period 

Start 

End 

1 

450,739 

2011/3/11 

14:48:13 

2011/3/13 

23:13:04 

Retweets  of  the  earthquake  bulletin  posted  by 
the  PR  department  of  Japan  Broadcasting  Cor¬ 
poration,  NHK  (@NHK_PR).1 

2 

27,372 

2011/3/11 

15:13:57 

2011/3/11 

16:19:26 

Retweets  of  the  article  on  to-do  list  at  the  time  of 
earthquake  onset  posted  by  a  victim  of  the  Great 
Hanshin-Awaji  Earthquake, 2 

3 

167,528 

2011/3/12 

00:18:19 

2011/3/14 

22:08:20 

Retweets  of  the  article  on  measures  against  cold 
at  an  evacuation  site  posted  by  the  news  depart¬ 
ment  of  NHK  (@nhk_seikatsu). 

4 

423,594 

2011/3/13 

18:38:50 

2011/3/19 

02:20:58 

Ordinary  tweets  irrelevant  to  the  earthquake 
posted  to  a  special  account  “@itsumonoTL”. 

5 

63,485 

2011/03/11 

15:05:08 

2011/03/12 

01:52:13 

Retweets  of  the  earthquake  bulletin  posted  by 
the  Fire  and  Disaster  Management  Agency 
( @  FDMA  JAPAN ) . 

6 

18,299 

2011/3/11 

15:45:17 

2011/3/11 

17:19:02 

Retweets  of  a  call  for  help  posted  by  a  user  who 
seemed  to  be  buried  under  a  server  rack  (later 
found  to  be  a  false  rumor). 

of  such  a  special  case  of  “ordinary  timeline”,  that  we  are  able  to  detect  efficiently  a  time 
period  where  tweets  on  a  specific  topic  (of  urgent  need  in  this  example)  are  intensively 
retweeted  by  looking  at  the  change  points  detected  by  the  proposed  method  even  from 
the  diffusion  sequence  that  contains  multiple  topics. 

We  note  that  the  cumulative  number  of  the  tweets  for  the  2nd  and  6th  ranked  diffu¬ 
sion  sequences  is  smaller  than  the  other  4  sequences  from  Table  1,  and  the  burst  period 
of  these  2  sequences  are  much  shorter  than  others  and  there  is  little  changes  in  the  num¬ 
ber  of  tweets  before  and  after  the  burst  from  Figure  3.  This  difference  is  considered  to 
come  from  whether  the  account  is  private  or  public.  Among  these  4  sequences,  except 
for  the  exceptional  4th  one,  the  remaining  3  are  all  from  the  public  organization  ac¬ 
counts  (1st  and  3rd  are  NHK  and  5th  is  FDMA).  Information  posted  by  these  accounts 
tends  to  disseminate  widely  everyday.  Thus,  considering  this  situation,  it  is  natural  to 
observe  that  the  cumulative  number  of  tweets  shows  a  relatively  smooth  increase  as  seen 
in  Figure  3  by  adding  multiple  bursts  of  short  periods  about  the  earthquake-related  in¬ 
formation  of  urgent  need  as  shown  in  Table  1.  Figure  3(e)  has  only  one  smooth  change 
during  the  burst  period,  which  indicates  that  the  earthquake  bulletin  in  Table  1  is  the 
only  source  of  the  burst.  On  the  other  hand,  we  see  multiple  smooth  changes  with  dis¬ 
continuity  of  the  gradient  at  each  boundary  during  the  burst  period  in  Figures  3(a)  and 
(c).  This  implies  that  there  can  be  other  sources  of  the  burst  than  shown  in  Table  1. 
Indeed,  it  is  possible  to  identify  these  change  points  by  increasing  the  value  of  /  (an 
example  explained  later).  On  the  other  hand.  Figures  3(b)  and  (f)  shows  that  the  infor¬ 
mation  posted  by  an  individual  that  is  rarely  retweeted  in  ordinary  situations  can  be 
propagated  explosively  if  it  is  of  urgent  need,  e.g.  timely  information  about  earthquake. 
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(e)  5th  ranked 


(f)  6th  ranked 


Fig.  3.  Temporal  change  of  cumulative  number  of  tweets  in  the  top  6  diffusion  results  in  terms  of 
the  highest  log-likelihood  ratio 


Here,  we  report  the  result  when  we  increase  the  number  of  change  points.  Figure  4 
shows  the  result  for  the  3rd  ranked  sequence  in  Figure  3(c)  when  J  is  set  to  9.  There 
are  9  vertical  lines  corresponding  to  each  change  point,  but  the  first  two  change  points 
are  too  close  and  indistinguishable.  Note  that  horizontal  axis  is  enlarged  and  the  range 
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xIO4 


Time  (unix  timestamp) 


Fig.  4.  Finer  burst  detection  for  the  3rd  ranked  sequence  in  Figure  4(c)  when  J  is  set  to  9 


shown  is  different  from  that  in  Figure  3(c).  We  see  that  the  detected  change  points  are 
located  at  the  boundary  points  where  the  gradients  of  the  curves  change  discontinuously. 
Those  4  broken  lines  in  green  are  considered  to  indicate  the  end  of  the  burst  because 
the  gradient  change  across  each  boundary  is  rather  smaller.  In  fact,  we  investigated  the 
most  recent  10  tweets  for  these  4  change  points  and  confirmed  that  no  more  than  half 
of  the  retweets  is  talking  about  the  same  topic  except  the  one  second  from  the  last  in 
which  7  of  them  are  on  the  same  topic.  The  remaining  5  change  points  (red  lines)  all 
contain  at  least  7  retweets  (10,  8,  7,  7,  9)  that  are  on  the  same  topic.  From  this  fact,  we 
can  reconfirm  that  there  appear  many  tweets  on  the  same  topic  during  the  burst  period. 


Line  Shape  Tree  vs.  Star  Shape  Tree  Note  that  all  of  these  results  were  obtained  by 
assuming  that  the  information  diffuses  along  the  line  shape  tree  as  discussed  in  Sec¬ 
tion  3.  Here,  we  show  that  use  of  line  shape  tree  gives  better  results  than  use  of  star 
shaped  tree.  To  this  end,  we  compared  the  bursts  detected  for  the  2nd  and  6th  ranked 
information  diffusion  sequences  which  include  only  one  burst. 

The  results  are  illustrated  in  Figure  5,  where  red  solid  and  green  broken  vertical 
lines  denote  the  change  points  detected  by  the  naive  method  with  the  line  shape  and  star 
shape  settings,  respectively.  Only  the  time  range  of  interest  is  extracted  and  shown  in 
the  horizontal  axis.  From  these  figures,  we  observe  that  use  of  line  shape  tree  detects 
the  change  points  more  precisely  as  expected,  which  means  that  line  shape  tree  gives  a 
better  approximation  of  the  maximum  likelihood  estimator  than  star  shape  tree  even  if 
the  actual  tree  shape  of  the  diffusion  path  is  not  known  to  us. 
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Time  (unix  timestamp) 


(a)  2nd  ranked 


xIO4 


(b)  6th  ranked 


Fig.  5.  Comparison  of  bursts  detected  by  use  of  line  shape  tree  and  star  shape  tree  for  the  2nd  and 
6th  ranked  information  diffusion  sequences  in  Table  1 . 


Change  Points  in  a  Time  Line  and  Nodes  in  a  Network  Remember  that  each  ob¬ 
served  time  point  corresponds  to  a  node  in  a  social  network.  In  this  sense,  it  can  be 
said  that  the  proposed  method  detects  not  only  the  change  points  in  a  time  line,  but  also 
the  change  points  in  a  network.  However,  unfortunately,  those  nodes  do  not  necessarily 
correspond  to  those  which  actually  caused  the  burst  period.  For  example,  in  the  second 
ranked  sequence  in  Table  1,  we  observed  at  least  1  retweet  of  the  article  described  in 
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Table  1  per  second  after  the  start  of  the  burst,  2011/3/11  15:13:57,  while  we  observed 
at  most  20  per  minute  before  the  burst  started.  This  shows  the  accuracy  of  the  detected 
change  point,  but  it  also  means  that  the  node  that  actually  influenced  nodes  within  the 
burst  period  could  exist  in  the  period  before  the  change  point.  Indeed,  we  observed  the 
first  retweet  at  2011/3/1 1  15:07:05  and  69  retweets  thereafter  before  the  change  point.  It 
is  natural  to  think  that  some  of  them  played  an  important  role  on  the  explosive  diffusion 
of  the  article.  We  need  to  know  the  actual  information  diffusion  path  to  find  such  im¬ 
portant  nodes,  but  detecting  change  points  in  a  time  line  would  significantly  reduce  the 
effort  needed  to  do  so  because  the  search  can  be  focused  on  the  limited  sub-sequences 
around  the  change  points.  Devising  a  method  to  find  such  important  nodes  is  one  of  our 
future  work. 


6  Conclusion 


We  addressed  the  problem  of  detecting  the  period  in  which  information  diffusion  burst 
occurs  from  a  single  observed  diffusion  sequence  under  the  assumption  that  the  delay 
of  the  information  propagation  over  a  social  network  follows  the  exponential  distribu¬ 
tion.  To  be  more  precise,  we  formulated  the  problem  of  detecting  the  change  points  and 
finding  the  values  of  the  time  delay  parameter  in  the  exponential  distribution  as  an  op¬ 
timization  problem  of  maximizing  the  likelihood  of  generating  the  observed  diffusion 
sequence.  We  devised  an  efficient  iterative  search  algorithm  for  the  change  point  detec¬ 
tion  whose  time  complexity  is  almost  linear  to  the  number  of  data  points.  We  tested  the 
algorithm  against  the  real  Twitter  data  of  the  2011  To-hoku  earthquake  and  tsunami, 
and  experimentally  confirmed  that  the  algorithm  is  much  more  efficient  than  the  ex¬ 
haustive  naive  search  and  is  much  more  accurate  than  the  simple  greedy  search.  By 
analyzing  the  real  information  diffusion  data,  we  revealed  that  even  if  the  data  contains 
tweets  talking  about  plural  topics,  the  detected  burst  period  tends  to  contain  tweets  on 
a  specific  topic  intensively.  In  addition,  we  experimentally  confirmed  that  assuming  the 
information  diffusion  path  to  be  the  line  shape  tree  results  in  much  better  approximation 
of  the  maximum  likelihood  estimator  than  assuming  it  to  be  the  star  shape  tree.  This  is  a 
good  heuristic  to  accurately  estimate  the  change  points  when  the  actual  diffusion  path  is 
not  known  to  us.  These  results  indicate  that  it  is  possible  to  detect  and  identify  both  the 
burst  period  and  the  topic  diffused  without  extracting  the  tweet  sequence  for  each  topic 
and  identifying  the  diffusion  paths  for  each  sequence,  and  the  proposed  method  can  be 
a  useful  tool  to  analyze  a  huge  amount  of  information  diffusion  data.  Our  immediate 
future  work  is  to  compare  the  proposed  method  with  existing  burst  detection  methods 
that  are  designed  for  data  stream.  We  also  plan  to  devise  a  method  of  finding  nodes  that 
caused  the  bust  based  on  the  change  points  detected. 
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ABSTRACT 

We  address  the  problem  of  visualizing  structure  of  undi¬ 
rected  graphs  that  have  a  value  associated  with  each  node 
into  a  K-dimensional  Euclidean  space  in  such  a  way  that  1) 
the  length  of  the  point  vector  in  this  space  is  equal  to  the 
value  assigned  to  the  node  and  2)  nodes  that  are  connected 
are  placed  as  close  as  possible  to  each  other  in  the  space 
and  nodes  not  connected  are  placed  as  far  apart  as  possible 
from  each  other.  The  problem  is  reduced  to  K-dimensional 
spherical  embedding  with  a  proper  objective  function.  The 
existing  spherical  embedding  method  can  handle  only  a  bi¬ 
partite  graph  and  cannot  be  used  for  this  purpose.  The  other 
graph  embedding  methods,  e.g.,  multi-dimensional  scaling, 
spring  force  embedding  methods,  etc.,  cannot  handle  the 
value  constraint  and  thus  are  not  applicable,  either.  We 
propose  a  very  efficient  algorithm  based  on  a  power  iteration 
that  employs  the  double-centering  operations.  We  apply  the 
method  to  visualize  the  information  diffusion  process  over  a 
social  network  by  assigning  the  node  activation  time  to  the 
node  value,  and  compare  the  results  with  the  other  visu¬ 
alization  methods.  The  results  applied  to  four  real  world 
networks  indicate  that  the  proposed  method  can  visualize 
the  diffusion  dynamics  which  the  other  methods  cannot  and 
the  role  of  important  nodes,  e.g.  mediator,  more  naturally 
than  the  other  methods. 

Categories  and  Subject  Descriptors 

1.2.6  [Learning]:  Parameter  learning 

Keywords 

graph  embedding,  visualization,  information  diffusion 

1.  INTRODUCTION 

Complex  network  is  hard  to  understand.  Visualization 
can  help,  but  in  reality  it  is  not  self-evident  whether  there 
exists  a  good  general  visualization  scheme  that  satisfies  most 
of  our  needs.  Especially  if  we  want  to  visualize  the  dynamics 
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taking  place  over  a  network,  the  only  solution  seems  to  use 
animation  over  time,  which  is  not  what  we  are  aiming  at. 

We  consider  the  following  problem:  Visualize  the  struc¬ 
ture  of  an  undirected  graph  that  has  a  value  assigned  to 
each  node  in  a  K-dimensional  Euclidean  space  in  such  a  way 
that  1)  the  length  of  the  point  vector  in  this  space  is  equal  to 
the  node  value  and  2)  nodes  that  are  connected  are  placed 
as  close  as  possible  to  each  other  in  the  space  and  nodes  not 
connected  are  placed  as  far  apart  as  possible  from  each  other. 
The  constraint  1)  is  unique  to  this  method  and  brings  more 
flexibility  in  visualization.  In  fact,  this  enables  to  visualize 
a  dynamics  mentioned  in  the  beginning. 

The  need  for  visualization  is  so  high  that  various  graph 
embedding  methods  have  already  been  proposed  and  are 
widely  used.  These  include  multi-dimensional  scaling  [15], 
spectral  embedding  [2] ,  spring  force  embedding  [4]  and  cross¬ 
entropy  embedding  [16].  All  of  them  are  applicable  to  undi¬ 
rected  graphs.  Spherical  embedding  [11,  3]  that  came  a  little 
later  is  designed  to  visualize  bipartite  graphs.  Among  these 
five,  the  first  four  cannot  handle  the  constraint  1).  The  last 
one  cannot  apply  to  a  general  undirected  graph.  To  our 
knowledge,  there  is  no  method  that  can  directly  handle  our 
problem.  Further,  apart  from  the  above  problem,  those  that 
solve  non-linear  optimization  problem  by  a  power  iteration, 
except  [3],  are  extremely  slow. 

We  show  that  the  above  visualization  problem  is  reduced 
to  spherical  embedding  that  is  formulated  as  a  non-linear 
optimization  problem  which  maximizes  a  certain  objective 
function  that  involves  an  operation  called  “double-centering”. 
The  problem  can  be  solved  by  a  simple  power  iteration  as  is 
done  in  the  above  existing  methods,  but  this  is  very  ineffi¬ 
cient.  We  propose  a  much  more  efficient  algorithm  making 
effective  use  of  the  sparsity  of  the  adjacency  matrix,  which 
is  true  for  most  complex  networks.  We  verify  that  the  algo¬ 
rithm  works  as  intended  by  applying  it  to  the  visualization  of 
information  diffusion  process  over  a  large  social  network  by 
assigning  the  node  activation  time  to  the  node  value  (detail 
in  Section  4.2).  The  results  obtained  by  four  real  world  so¬ 
cial  networks  confirm  our  conjecture.  Time  evolution  of  the 
diffusion  process  is  easily  visualized  by  the  proposed  method 
and  in  this  process  such  nodes  that  have  a  role  of  mediating 
the  diffusion  are  more  easily  identifiable  than  the  other  ex¬ 
isting  methods  which  cannot  handle  the  diffusion  dynamics. 

This  paper  is  organized  as  follows.  We  first  describe  the 


problem  framework  of  embedding  undirected  graphs  into 
a  low  dimensional  Euclidean  space  (2.1.),  show  a  simple 
update  method  for  solving  the  optimal  solution  (2.1),  fol¬ 
lowed  by  the  proposed  efficient  update  method  (2.3).  Next 
we  briefly  compare  the  proposed  method  with  four  existing 
methods  (3).  We  then  explain  how  we  apply  the  method  to 
the  visualization  of  information  diffusion  (4),  and  report  the 
results  (5).  We  conclude  the  paper  by  summarizing  what 
has  been  achieved  (7). 


with  respect  to  the  matrix  X  constructed  from  the  position 
vectors. 
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2.  SPHERICAL  GRAPH  EMBEDDING 

We  describe  the  framework  of  embedding  an  undirectded 
graph  G  =  (V,  E )  without  self-loops  into  a  A'-dimensional 
Euclidean  space,  where  V  and  E  (c  V  x  V)  stand  for  the 
sets  of  all  the  nodes  and  links,  respectively.  For  the  sake  of 
technical  convenience,  we  identify  the  set  of  the  nodes,  V, 
by  a  series  of  positive  integers,  i.e.,  V  =  (1,  ■  •  •  ,  m,  ■  ■  ■  ,  Mj. 
Here  M  is  the  number  of  the  nodes  in  V  ,  i.e.,  \V\  =  M. 
Then,  we  can  define  the  M  x  M  adjacency  matrix  A  = 
{ am,n }  by  setting  am,n  =  1  if  {m,  n}  e  E-  am,n  =  0  oth¬ 
erwise.  Note  taht  am,„  =  an,m  and  am,m  =  0.  We  denote 
the  A'-dimensional  embedding  position  vectors  by  xm  for 
the  node  m  £  V.  Then  we  can  construct  the  K  x  M  matrix 
consisting  of  these  position  vectors,  i.e.,  X  =  (xj,  •  •  •  ,xm). 

2.1  Problem  Formulation 

We  first  state  the  framework  of  our  embedding  problem 
intuitively:  For  a  given  undirectded  graph  G  =  ( V ,  E)  and 
a  set  of  values  assigned  to  each  node,  denoted  by  (r i,  •  •  • , 
rm,  •  •  • ,  rM),  we  attempt  to  visualize  the  graph  so  that  each 
pair  of  nodes  with  similar  connection  patterns  is  embedded 
as  a  pair  of  position  vectors  with  similar  directions,  and  each 
length  of  the  embedded  position  vectors  is  set  to  the  above 
value  assigned  to  the  node,  i.e.,  ||xm||  =  rm  for  each  m, 
where  ||xm||  stands  for  the  norm  of  the  vector  xm. 

In  order  to  more  closely  explain  our  embedding  problem, 
we  introduce  the  centering  (Young-Householder  transforma¬ 
tion)  matrix, 

Hm  =  J-m  —  — ImIm,  (1) 

where  I m  stands  for  the  M  x  M  identity  matrix,  1  m  is  an 
M-dimensional  vector  whose  elements  are  all  one,  and  1T 
means  the  transposition  of  the  vector  1.  Clearly,  the  mean 
vector  of  the  resulting  position  vectors  becomes  0  by  the 
operations  XH m-  Then,  we  consider  the  following  double- 
centered  matrix  B  =  {bm,n}  that  is  calculated  from  the 
adjacency  matrix  A. 


where  (Am  |  m  =  1,  •  •  •  M}  correspond  to  Lagrange  multi¬ 
pliers  for  the  constraints,  i.e.,  x^(xm  =  for  1  <  m  <  M. 
Intuitively,  maximizing  J(X)  pushes  the  pairs  xm  and  xn  to 
the  same  direction  if  they  are  connected  and  pushes  them  to 
the  opposite  direction  if  they  are  unconnected,  and  realizes 
the  intended  visualization. 

Now,  we  consider  a  reparameterization  of  each  position 
vector  xm  by  xm  =  xm/rm,  and  set  X  =  (xi,---  ,xm)T. 
Then,  we  can  equivalently  transform  our  objective  function 
defined  in  Equation  (4)  as  follows, 

M- 1  M  i  M 

<7(X)  =  ^  ^  )  bm,nX-mX.ri  T  ^  /*  '  Mm  ( 1  XmXm),  (5) 

m= 1  n=m+l  m= 1 

where  Mm  =  A m/r^,  for  each  m.  Thus,  maximizing  Equa¬ 
tion  (4)  is  implemented  by  the  following  two  steps:  First,  we 
calculate  the  position  vector  xm  for  each  node  on  the  unit 
sphere  (circle),  so  as  to  maximize  Equation  (5);  Then,  we 
can  obtain  the  final  position  vectors  just  by  rescaling  them 
with  respect  to  (r  1,  •  •  •  ,tm),  i.e.,  xm  =  rmxm  for  each  m. 
Thus  we  can  regard  our  problem  as  a  shperical  graph  em¬ 
bedding  problem  on  the  unit  sphere.  Hereafter,  we  simply 
denote  xm  as  x„,  in  order  to  avoid  notational  complication. 
Here  we  should  emphasize  that  in  our  problem  formaliza¬ 
tion,  the  directions  of  the  embedded  position  vectors  are 
determined  independently  from  the  values  assigned  to  each 
node. 


2.2  Simple  Update  Method 

Now  we  consider  maximizing  J(X)  defined  in  Equation  (5) 
by  use  of  a  coordinate  strategy:  We  maximize  J(X)  with  re¬ 
spect  to  each  position  vector  xm,  by  fixing  the  other  position 
vectors.  In  order  to  optimally  update  each  position  vector 
xm,  we  consider  the  following  gradient  vector  of  the  objec¬ 
tive  function  J(X)  with  respect  to  xm. 


<9J(X) 

<9xm 


E 


n=l,n^m 


(6) 


B  =  HmAHm. 


(2) 


Note  that  the  mean  vectors  of  both  the  row  and  the  column 
vectors  of  the  matrix  B  become  0.  On  the  other  hand,  for 
position  vectors  {xi,  •  •  •  ,xm},  we  can  consider  the  similar¬ 
ity  matrix  C  =  {cm,n},  each  element  of  which  is  defined  by 
the  following  cosine  similarity. 


c 


m,n  — 


(3) 


As  the  basic  strategy  of  our  graph  embedding,  we  maxi¬ 
mize  the  correlation  between  the  the  double-centered  matrix 
B  and  the  cosine  similarity  matrix  C  by  adequately  locat¬ 
ing  each  position  vector  under  the  constraints  ||xm||  =  rm. 
Namely,  we  can  consider  the  following  objective  function 


Thus,  for  the  fixed  vectors  {xi,  •  •  ■  xm}  \  x„, ,  we  obtain  the 
optimal  position  vector  xm  which  maximizes  the  objective 
function  J(X)  as  follow: 


x 


m  — 


(7) 


where 


M 

fm  =  ^  ^  bm.nX-n  =  (X  xmem)Bem.  (8) 

n=l,n^m 


Here  em  is  an  M-dimensional  unit  vector  whose  m-th  ele¬ 
ment  is  1,  and  the  other  elements  are  0. 

However,  this  simple  iteration  method  requires  the  com¬ 
putational  complexity  of  O(MK)  for  updating  each  optimal 


position  vector  according  to  Equation  (8).  In  order  to  make 
better  use  of  the  sparsity  of  adjacency  matrix  which  is  fre¬ 
quently  observed  in  most  complex  networks,  we  derive  an 
efficient  way  of  calculating  Equation  (8)  in  the  succeeding 
subsection. 


2.3  Efficient  Update  Method 

We  first  focus  on  the  following  equivalent  formula  for  cal¬ 
culating  fm  in  Equation  (8). 

f m  —  X^Be,„  (emBem)xm. 

Here  we  consider  a  degree  vector  defined  by 


d  =  (di, 

and  their  average, 


,  dju)T  =  Aljt 


(9) 

(10) 


(ii) 


Then,  from  the  definition  of  double-centered  matrix  B  given 
in  Equation  (2),  we  can  calculate  Bem  as  follows. 


Be,, 


=  (Im  —  — 1a/1m)A(Im  —  Yj  lMl«)e“ 


D  -  dm  1 

-  Aem  +  ^— iM-jjd. 


(12) 


By  noting  that  e^Aem  =  0  because  of  no  self- loops,  we 
obtain  elBem  as  follows. 


eT  Be  -  — _ — 


(13) 


Now  we  define  the  average  position  vector  <p  and  the 
degree-weighted  average  position  vector  ip  by 


=  V>  =  ^xd, 


(14) 


respectively.  Then  by  substituting  Equations  (12)  and  (13) 
into  Equation  (9),  we  can  obtain  the  following. 

+  (D  -  dm)4>  -  Ip  -  D  J^dm  Xm,  (15) 


f m  —  ^  ^  X, 

nCr(m) 


where,  T(m)  denotes  a  set  of  neighbour  nodes  of  v,  i.e.,  those 
nodes  that  are  connected  to  v.  Thus  by  noting  that  both  <p 
and  ip  are  A'-dimensional  vectors,  and  the  average  number  of 
elements  in  T(m)  is  D,  i.e.,  D  =<  |r(m)|  >,  we  can  see  that 
the  average  computational  complexity  of  calculating  fm  is 
reduced  to  O(DK)  from  O(MK)  in  average.  As  mentioned 
earlier,  we  can  naturally  assume  M  D  for  a  wide  variety 
of  complex  networks. 

On  the  other  hand,  after  updating  the  position  vector  xm, 
we  need  to  update  vectors  <p  and  ip  according  to  this  change 
as  well.  For  this  purpose,  after  setting  the  updated  vector 
y m  and  the  modification  vector  Ax^  by, 

y m  =  TJ7  77  fm ,  Axm  =  y m  ~  Xm,  (16) 


we  update  the  vectors  </>,  t/?,  and  xm  as  follows. 


,  =  <^+mAx™’ 


1p=1p+  jjAXm, 


(17) 


Clearly,  these  updates  can  be  done  within  the  computational 
complexity  of  O(K).  Thus,  we  can  see  that  the  computa¬ 
tional  complexity  of  updating  x„,  is  equal  to  O(DK). 

Below  we  summarize  our  spherical  embedding  algorithm 
proposed  in  this  paper. 


1.  Initialize  position  vectors  {xi,  •  •  •  ,xm)  adequately;  and 

calculate  vectors  <p  and  ip  by  Equation  (14); 

2.  For  each  m  £  {1,  •  •  •  ,  M},  calculate  fm  by  Equation  (15), 

set  vectors  ym  and  A xm  by  Equation  (16),  and  update 
vectors  <p,  ip,  and  xm  by  Equation  (17); 

3.  If  rnaxm{||<9J(X)/<9xm||}  <  e,  output  {xi,  ■  ■  ■  , xm}  and 

terminate; 

4.  Return  to  the  step  2. 

Our  proposed  algorithm  employs  a  power  iteration  as  the 
basic  framework,  just  like  the  HITS  algorithm  [8],  which 
utilizes  A  and  Ar,  does.  However,  the  main  differences 
are  use  of  the  double-centering  operation  by  Hm,  and  the 
constraints  described  by  ||x,„||  =  rm.  Here  note  that  the 
double-centering  operation  is  also  employed  in  the  standard 
multidimensional  scaling  method  [15]. 

Now  we  briefly  mention  the  computational  complexity  of 
our  algorithm.  Clearly,  the  main  computational  complexity 
of  one-iteration  comes  from  the  multiplication  by  the  ma¬ 
trix  A  with  position  vectors  xm,  which  is  the  most  compu¬ 
tationally  intensive  part  and  is  proportional  to  the  number 
of  links  in  the  undirected  graph.  Thus,  the  proposed  algo¬ 
rithm  is  expected  to  work  much  faster  especially  for  a  sparse 
undirected  graph.  In  fact,  it  has  been  well  known  that  the 
PageRank  algorithm  [1]  based  on  a  power  iteration  works 
very  fast  for  a  large  and  sparse  network  [10]  even  without 
parallel  distributed  processing. 

3.  ALGORITHMIC  COMPARISON  WITH 
CONVENTIONAL  METHODS 

We  compare  the  proposed  method  from  algorithmic  as¬ 
pect  with  the  four  well  known  embedding  methods:  multi¬ 
dimensional  scaling  [15],  spectral  embedding  [2],  spring  force 
embedding  [4],  and  cross-entropy  embedding  [16].  Here  the 
former  two  perform  a  power  iteration  with  respect  to  either 
a  double-centered  distance  matrix  or  a  graph  Laplacian  ma¬ 
trix  which  is  calculated  from  a  given  graph,  while  the  latter 
two  repeatedly  move  each  position  vector  by  using  the  New¬ 
ton  method  in  a  framework  of  nonlinear  optimization.  Here 
note  that  the  basic  strategy  of  our  method  is  a  combina¬ 
tion  of  the  above  basic  strategy,  i.e.,  our  method  performs 
a  power  iteration  with  respect  to  a  double-centered  adja¬ 
cency  matrix  while  repeatedly  moving  each  position  vector. 
However,  recall  that  these  existing  methods  cannot  directly 
utilize  the  values  associated  with  nodes.  In  what  follows, 
we  compare  our  method  more  closely  with  these  existing 
methods. 

Multi-dimensional  scaling  method  [15]  first  calculates  the 
distance  matrix  G,  and  performs  the  double  centering  opera¬ 
tion  to  the  distance  matrix.  Mathematically  it  is  formulated 
as  minimizing  Equation  (18). 

1  K 

M(X)  =  -^zI(H„GHmK  (18) 

Z  k  =  1 

where  Zfe  =  (*i,fc,  •  •  •  ,XM,k)T ,  and  (zi,  •  •  •  ,  z k}  need  to  be 
orthonormal  vectors,  i.e.,  zjTzfc  =  1  and  z^zk>  =  0  if  k  ^  k' . 

Spectral  embedding  method  [2]  tries  to  directly  minimize 
distances  between  position  vectors  of  connecting  nodes.  Math- 


ematically  it  is  formulated  as  minimizing  Equation  (19). 


S(X) 


K  M  N 


bEEE  &m,n  ( Zk  ,m  Zk,n) 


k= 1  m=  1  n= 1 


^zfe(D  -  A)z k, 


(19) 


where  D  is  a  diagonal  matrix  each  element  of  which  is  the 
degree  of  node  (number  of  links).  Note  that  (D  —  A)  is 
referred  to  as  a  graph  Laplacian  matrix.  Again,  we  set  Zfc  = 
(sci,k,  •  •  •  ,  XM,k)T ,  and  {zi,  •  •  •  ,  z k}  need  to  be  orthonormal 
vectors,  which  excludes  the  trivial  vector  expressed  as  z  oc 
1m- 

Spring  force  embedding  method  [4]  assumes  that  there  is  a 
hypothetical  spring  between  each  connected  node  pair  and 
locates  nodes  such  that  the  distance  of  each  node  pair  is 
closest  to  its  minimum  path  length  at  equilibrium.  Mathe¬ 
matically  it  is  formulated  as  minimizing  Equation  (20). 


M  —  l  M 

/C(X)  =  22  E  am,n(gm,n  -  ||xm  -  Xn  ||)2,  (20) 

m= 1  n=m-\- 1 


where  am,n  is  a  spring  constant  which  is  normally  set  to 

1/(2  gl,v). 

Cross-entropy  embedding  method  [16]  first  defines  a  sim¬ 
ilarity  p(xm,x„)  between  the  embedding  positions  xm  and 
xn  and  uses  the  corresponding  element  am,n  of  the  adja¬ 
cency  matrix  as  a  measure  of  distance  between  the  node 
pair,  and  tries  to  minimize  the  total  cross  entropy  between 
these  two.  Mathematically  it  is  formulated  as  minimizing 
Equation  (21). 

M- 1  M 

w  =  -E  E  (cLm,n  log  p(x  m  •)  Xn) 

m= 1  n=m-\- 1 

+  (1  -  Om,n)  log(l  -  p(x„,X„)))  .  (21) 

Here,  note  that  we  used  the  function  p(xu,  x„)  =  exp(—  1  ||xu  — 
x„||2)  in  our  experiments. 

The  spectral  embedding  method  is  expected  to  work  com¬ 
parable  to  our  method  because  these  methods  perform  a 
power  iteration  on  a  sparse  adjacency  matrix.  The  multi¬ 
dimensional  scaling  method  requires  a  substantially  large 
computation  time  because  it  needs  to  perform  a  power  it¬ 
eration  on  a  full  distance  matrix.  Spring  force  embedding 
method  and  Cross-entropy  embedding  method  both  of  which 
repeatedly  move  each  position  vector  by  using  the  Newton 
method,  require  an  extremely  large  computation  time  before 
the  final  results  are  obtained. 


4.  APPLICATION  TO  VISUALIZATION  OF 
INFORMATION  DIFFUSION  DATA 

Our  primary  application  of  the  proposed  method  is  visual¬ 
ization  of  information  diffusion  process  over  a  social  network. 
We  start  with  a  brief  description  of  the  diffusion  models  we 
used  and  then  describe  how  we  visualize  the  diffusion  data. 

4.1  Information  diffusion  models 

We  focus  on  the  IC  (Independent  Cascade)  and  the  LT 
(Linear  Threshold)  models  [5]  as  the  representative  mod¬ 
els  of  information  diffusion,  and  utilize  their  extended  ver¬ 
sion  that  can  cope  with  asynchronous  time  activation,  AsIC 


(Asynchronous  IC)  and  AsLT  (Asynchronous  LT)  models 
[13,  14]  in  our  experiments. 

4.1.1  Asynchronous  Independent  Cascade  Model 

We  first  recall  the  definition  of  the  IC  model  according  to 

the  work  of  [5],  and  then  introduce  the  AsIC  model.  In  the 
IC  model,  we  specify  a  real  value  Pm,n  with  0  <  pm,n  <  1 
for  each  link  (m,  n )  in  advance.  Here  pm,n  is  referred  to  as 
the  diffusion  probability  through  link  (m,  n).  The  diffusion 
process  unfolds  in  discrete  time-steps  t  >  0,  and  proceeds 
from  a  given  initial  active  set  S  in  the  following  way.  When 
a  node  m  becomes  active  at  time-step  t,  it  is  given  a  sin¬ 
gle  chance  to  activate  each  currently  inactive  child  node  n, 
and  succeeds  with  probability  Pm,n-  If  m  succeeds,  then  n 
will  become  active  at  time-step  t  +  1.  If  multiple  parent 
nodes  of  n  become  active  at  time-step  t,  then  their  acti¬ 
vation  attempts  are  sequenced  in  an  arbitrary  order,  but 
all  performed  at  time-step  t.  Whether  or  not  m  succeeds, 
it  cannot  make  any  further  attempts  to  activate  n  in  subse¬ 
quent  rounds.  The  process  terminates  if  no  more  activations 
are  possible. 

In  the  AsIC  model,  we  specify  real  values  rm,n  with  rm,n  > 
0  in  advance  for  each  link  ( m,n )  €  E  in  addition  to  Pm,n, 
where  rm,„  is  referred  to  as  the  time-delay  parameter  through 
link  (m,  n).  The  diffusion  process  unfolds  in  continuous-time 
t,  and  proceeds  from  a  given  initial  active  set  S  in  the  follow¬ 
ing  way.  Suppose  that  a  node  m  becomes  active  at  time  t. 
Then,  m  is  given  a  single  chance  to  activate  each  currently 
inactive  child  node  n.  We  choose  a  delay-time  8  from  the 
exponential  distribution1  with  parameter  rmi„.  If  n  has  not 
been  activated  before  time  t+5,  then  m  attempts  to  activate 
n,  and  succeeds  with  probability  Pm,n-  If  m  succeeds,  then  n 
will  become  active  at  time  t  +  5.  Said  differently,  whichever 
parent  m  that  succeeds  in  satisfying  the  activation  condition 
and  for  which  the  activation  time  is  the  earliest  considering 
the  time  delay  associated  with  each  link  can  actually  acti¬ 
vate  the  node.  Under  the  continuous  time  framework,  it  is 
unlikely  that  n  is  activated  simultaneously  by  its  multiple 
parent  nodes  exactly  at  time  t  +  8.  So  we  ignore  this  pos¬ 
sibility.  Whether  or  not  m  succeeds,  it  cannot  make  any 
further  attempts  to  activate  n  in  subsequent  rounds.  The 
process  terminates  if  no  more  activations  are  possible. 

4.1.2  Asynchronous  Linear  Threshold  Model 

Same  as  the  above,  we  first  recall  the  LT  model.  In  this 
model,  for  every  node  n  £  V,  we  specify  a  weight  ( qm,n  >  0) 
from  its  parent  node  m  in  advance  such  that 

''y  ]  Qm,n  +  I* 

The  diffusion  process  from  a  given  initial  active  set  S  pro¬ 
ceeds  according  to  the  following  randomized  rule.  First,  for 
any  node  n  £  V,  a  threshold  6n  is  chosen  uniformly  at  ran¬ 
dom  from  the  interval  [0, 1].  At  time-step  /,,  an  inactive  node 
n  is  influenced  by  each  of  its  active  parent  nodes,  m,  accord¬ 
ing  to  weight  qm,n.  If  the  total  weight  from  active  parent 
nodes  of  n  is  no  less  than  9n ,  that  is, 

'y  ]  9m,n  A  9r i, 

m£Bt  (n) 

1  Similar  formulation  can  be  derived  for  other  distributions 
such  as  power-law  and  Weibull. 


then  n  will  become  active  at  time-step  t  +  1.  Here,  Bt(n) 
stands  for  the  set  of  all  the  parent  nodes  of  n  that  are  active 
at  time-step  t.  The  process  terminates  if  no  more  activations 
are  possible. 

The  AsLT  model  is  defined  in  a  similar  way  to  the  AsIC. 

In  the  AsLT  model,  in  addition  to  the  weight  set  {gm,n}, 
we  specify  real  values  rm,n  with  rm,n  >  0  in  advance  for 
each  link  ( m,n ).  Same  as  for  AsIC,  we  refer  to  rm>n  as 
the  time-delay  parameter  through  link  (m,  n ).  The  diffusion 
process  unfolds  in  continuous-time  t,  and  proceeds  from  a 
given  initial  active  set  S  in  the  following  way.  Each  active 
parent  m  of  the  node  n  exerts  its  effect  on  n  with  the  time 
delay  5  drawn  from  the  exponential  distribution  with  the 
delay  parameter  rm,„ .  Suppose  that  the  accumulated  weight 
from  the  active  parents  of  node  n  has  become  no  less  than  9n 
at  time  t  for  the  first  time.  Then,  the  node  n  becomes  active 
at  t  without  any  delay  and  exerts  its  effect  on  its  child  with 
a  delay  associated  with  its  link.  This  process  is  repeated 
until  no  more  activations  are  possible. 

4.2  Visualization  Method 

Let  R  =  {{m,  tm),  (n,  f„),  •  •  ■  }  be  an  information  diffusion 
result  over  an  undirected  G  =  (V,E),  where  (n,  tn)  is  a  pair 
of  an  activated  node  and  its  activation  time.  We  set  the  ini¬ 
tial  activation  time  to  0.  From  the  set  of  nodes  that  appear 
in  R,  i.e.,  V'  =  {n  \  ( n,t„ )  £  R},  we  obtain  an  induced 
subgraph  G'  =  (' V',E Here,  we  regard  tn  as  n’s  associ¬ 
ated  value  for  n  £  V' .  If  m  £  V' ,  n  £  V' ,  (m,  n )  £  E' ,  and 
tm  <  tn,  the  direction  of  information  diffusion  is  limited  to 
from  node  m  to  n.  Namely,  a  directed  acyclic  graph  (DAG) 
is  constructed  from  the  information  diffusion  result  R.  Al¬ 
though  our  embedding  method  is  designed  for  undirected 
graph,  we  can  interpret  that  the  diffusion  of  information 
takes  over  from  the  origin  to  the  periphery  by  setting  the 
radius  of  node  n  to  tn ■  The  major  reason  why  we  restricted 
the  graph  we  handle  to  undirected  graph  is  to  maintain  clear 
meaning  of  the  objective  function  we  are  trying  to  maximize. 
Alternatively,  we  can  start  with  a  directed  graph  and  obtain 
a  directed  induced  subgraph.  Then  we  reinterpret  it  as  an 
undirected  subgraph,  and  apply  the  above  discussion. 

5.  EXPERIMENTAL  EVALUATION 

5.1  Datasets 

We  generated  diffusion  results  using  both  the  AsIC  and 
the  AsLT  models  for  four  large  real  social  networks.  They 
are  all  bidirectionally  connected  networks.  The  first  one  is 
a  trackback  network  of  Japanese  blogs  used  in  [7].  It  has 
12,047  nodes  and  79,920  directed  links  (the  blog  network). 

The  second  one  is  a  coauthorship  network  used  in  [12],  which 
has  12, 357  nodes  and  38,  896  directed  links  (the  coauthor¬ 
ship  network).  The  third  one  is  a  network  derived  from 
the  Enron  Email  Dataset  [9]  by  extracting  the  senders  and 
the  recipients  and  linking  those  that  had  bidirectional  com¬ 
munications.  It  has  4,  254  nodes  and  44,  314  directed  links 
(the  Enron  network).  The  fourth  one  is  a  network  of  people 
that  was  derived  from  the  “list  of  people”  within  Japanese 
Wikipedia,  used  in  [6],  and  has  9,481  nodes  and  245,044 
directed  links  (the  Wikipedia  network). 

5.2  Experimental  Results 

We  visualized  the  information  diffusion  result  in  2-dimensional 
Euclidean  space,  i.e.,  K  =  2,  and  compared  the  results  of  the 


proposed  method  with  the  other  four  existing  methods.  The 
initial  active  node  was  chosen  to  be  the  most  influential  node 
for  each  diffusion  model.  The  location  of  this  node  is  the 
origin  of  the  visualization  plane  for  the  proposed  method, 
but  the  location  of  the  same  node  for  the  other  methods  is 
not  controllable  and  determined  by  the  algorithm  of  each 
method.  The  proposed  method  has  the  time  information. 
A  family  of  blue  dotted  circles  of  different  radii  centered  at 
the  origin  indicates  the  activation  times,  where  the  radius 
t  of  each  blue  dotted  circle  corresponds  to  the  actual  time 
t.  For  all  the  visualization  methods,  red  points  and  green 
lines  are  used  to  display  the  activated  nodes  and  their  links, 
respectively.  It  is  noted  that  we  are  visualizing  from  the  ob¬ 
served  data,  meaning  that  we  don’t  know  the  parent  which 
activated  its  child  if  there  is  more  than  one  active  parent. 
Thus,  all  the  links  between  the  activate  parents  and  their 
active  children  are  displayed. 

Due  to  the  space  limitation,  we  only  show  parts  of  the  re¬ 
sults.  Figure  1  shows  the  visualization  result  of  information 
diffusion  for  the  AsIC  model  over  the  Blog  network  using 
the  proposed  method,  where  the  thick  black  circle  indicate 
the  initial  active  node.  It  is  clear  that  the  proposed  method 
have  the  following  properties: 

1.  Given  two  active  nodes,  we  can  easily  see  which  one 
became  active  earlier. 

2.  Given  an  active  node,  we  can  easily  identify  its  parents 
that  could  activate  it  (but  we  cannot  identify  it  if  there 
are  multiple  active  parents  by  the  reason  mentioned 
above). 
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Figure  1:  Visualization  result  of  proposed  method  for  the 
Blog  network  (AsIC  model). 

We  can  observe  that  in  general  super-mediators ,  i.e.,  those 
nodes  that  play  an  important  role  in  passing  the  informa¬ 
tion  to  other  nodes,  are  easily  identified  by  the  proposed 
method.  In  Figure  1,  the  thick  black  diamond  node  can  nat¬ 
urally  be  interpreted  as  a  super-mediator.  The  same  node 
is  also  displayed  as  thick  black  diamonds  in  Figures  2,  3,  4 
and  5.  We  notice  that  the  multi-dimensional  scaling  and  the 
spring  force  embedding  methods  are  also  good  to  find  super¬ 
mediators,  while  it  is  more  difficult  to  find  them  for  the  spec¬ 
tral  embedding  and  the  cross-entropy  embedding  methods. 
Note  that  the  multi-dimensional  scaling  and  the  spring  force 
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Figure  2:  Visualization  result  of  multi-dimensional  scaling 
for  the  Blog  network  (AsIC  model). 
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Figure  3:  Visualization  result  of  spectral  embedding  for  the 
Blog  network  (AsIC  model). 


embedding  methods  are  based  on  graph  distance  matrix  G, 
and  the  spectral  embedding  and  the  cross-entropy  embed¬ 
ding  methods  are  based  on  graph  adjacency  matrix  A.  For 
the  G-based  methods,  the  distance  from  the  initial  active 
node  (thick  black  circle)  to  an  active  node  v  in  the  visual¬ 
ization  plane  can  be  correlated  with  the  time  if  the  node  v 
is  an  active  node.  Thus,  we  can  consider  that  such  meth¬ 
ods  have  a  possibility  of  finding  super-mediators.  However, 
we  see  from  Figures  1  to  5  that  the  proposed  method  bet¬ 
ter  identifies  a  super-mediator  than  the  multi-dimensional 
scaling  and  the  spring  force  embedding  methods. 

Figure  6  shows  the  visualization  result  of  information  dif¬ 
fusion  for  the  AsLT  model  over  the  Blog  network.  Compared 
with  the  visualization  result  for  the  AsIC  model,  we  observe 
that  links  are  mostly  outward  directed  and  only  small  links 
are  in  circumferential  direction.  We  consider  that  this  ob¬ 
servation  comes  from  a  characteristic  difference  between  the 
AsIC  and  AsLT  models.  Especially,  in  case  of  the  AsLT 
model,  when  a  parent  node  becomes  active,  only  its  low 
degree  child  nodes  are  likely  to  be  activated.  Our  proposed 
method  will  locate  these  child  nodes  to  similar  directions  be¬ 
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Figure  4:  Visualization  result  of  spring  force  embedding  for 
the  Blog  network  (AsIC  model). 
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Figure  5:  Visualization  result  of  cross-entropy  embedding 
for  the  Blog  network  (AsIC  model). 


cause  their  connectivity  patterns  are  necessarily  close.  We 
consider  that  this  fact  partly  explains  the  difference  between 
the  visualization  results  of  Figure  1  and  6. 

Figures  7  and  8  respectively  show  the  visualization  results 
of  information  diffusion  for  the  AsIC  and  AsLT  models  over 
the  Wikipedia  network.  We  can  also  see  from  these  figures 
that  the  proposed  method  is  promising  for  identifying  influ¬ 
ential  super-mediators  and  exploring  the  characteristic  dif¬ 
ferences  between  the  two  information  diffusion  models.  As 
mentioned  earlier,  the  visualization  results  over  the  coau¬ 
thorship  and  Enron  networks  are  omitted  due  to  the  space 
limitation,  but  it  is  confirmed  that  we  obtained  similar  re¬ 
sults. 

Last  but  not  least,  we  evaluated  our  proposed  method  only 
in  the  case  of  two-dimensional  embedding  for  our  visualiza¬ 
tion  purpose,  but  this  does  not  mean  that  it  is  limited  to 
two-dimensional  embedding.  It  is  quite  easy  to  extend  it  to 
the  general  A-dimension  embedding.  We  plan  to  evaluate 
our  method  as  a  powerful  technique  for  both  dimensional 
reduction  and  clustering  as  a  future  work. 


Figure  6:  Visualization  result  of  proposed  method  for  the 
Blog  network  (AsLT  model). 


Figure  8:  Visualization  result  of  proposed  method  for  the 
Wikipedia  network  (AsLT  model). 
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Figure  7:  Visualization  result  of  proposed  method  for  the 
Wikipedia  network  (AsIC  model). 

6.  DISCUSSION 

One  of  the  unique  features  of  the  proposed  method  is  that 
we  deal  with  the  graph  that  has  a  value  to  each  node,  and 
the  visualization  takes  account  of  the  node  value.  The  appli¬ 
cation  to  information  diffusion  involves  time  evolution  and 
assigning  the  time  the  node  gets  activated  to  the  node  value 
works  nicely  to  allow  the  diffusion  starts  at  the  origin  always. 
On  the  contrary,  all  the  other  existing  methods,  when  ap¬ 
plied  to  the  same  visualization  problem,  generates  a  graph 
where  the  starting  point  of  the  diffusion  is  determined  by 
the  algorithm.  Thus,  if  we  want  to  visualize  multiple  results 
of  diffusion  sequences  each  starting  from  the  same  node,  the 
starting  node  in  each  visualization  is  placed  in  a  different 
location.  Thus,  the  above  feature  is  one  of  the  advantages 
of  the  proposed  method. 

7.  CONCLUSION 

We  addressed  the  problem  of  visualizing  structure  of  a 
undirected  graph  that  has  a  value  associated  with  each  node 
into  a  K-dimensional  Euclidean  space  in  such  a  way  that  1) 


the  length  of  the  point  vector  in  this  space  is  equal  to  the 
value  assigned  to  the  node  and  2)  nodes  that  are  connected 
are  placed  as  close  as  possible  to  each  other  in  the  space 
and  nodes  not  connected  are  placed  as  far  apart  as  possible 
from  each  other.  We  showed  that  this  visualization  prob¬ 
lem  is  reduced  to  spherical  embedding  that  is  formulated  as 
a  non-linear  optimization  problem  for  which  a  certain  ob¬ 
jective  function  to  be  maximized  is  defined.  We  proposed 
a  very  efficient  algorithm  based  on  a  power  iteration  that 
employs  double-centering  operations.  To  validate  the  effec¬ 
tiveness  of  the  proposed  method,  we  applied  it  to  visualize 
the  information  diffusion  process  over  a  social  network  by  as¬ 
signing  the  node  activation  time  to  the  node  value.  We  used 
the  result  of  information  diffusion  obtained  by  two  differ¬ 
ent  diffusion  models  (AsIC  and  AsLT  models)  for  four  real 
world  networks,  and  compared  the  proposed  method  with 
the  multi-dimensional  scaling,  the  spring  force  embedding, 
the  spectral  embedding  and  the  cross-entropy  embedding 
methods.  We  first  confirmed  that  the  proposed  method  can 
visualize  time  evolution  of  the  diffusion  process  in  an  more 
intuitively  understandable  manner.  We  also  confirmed  that 
the  proposed  method  have  the  following  properties:  1)  given 
two  active  nodes,  we  can  easily  see  which  one  became  active 
earlier,  and  2)  given  an  active  node,  we  can  easily  identify 
its  parents  that  could  activate  it  (note  that  we  are  visualiz¬ 
ing  from  the  observed  diffusion  data,  meaning  that  we  don’t 
know  the  parent  which  activated  its  child  if  there  is  more 
than  one  active  parent.)  Furthermore,  we  experimentally 
showed  that  the  proposed  method  can  better  identify  super¬ 
mediators,  i.e. ,  those  nodes  that  play  an  important  role  in 
passing  the  information  to  other  nodes,  than  the  other  four 
methods. 
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Abstract.  How  the  information  diffuses  over  a  large  social  network  depends  on 
both  the  model  employed  to  simulate  the  diffusion  and  the  network  structure  over 
which  the  information  diffuses.  We  analyzed  both  theoretically  and  empirically 
how  the  two  contrasting  most  fundamental  diffusion  models.  Independent  Cas¬ 
cade  (IC)  and  Linear  Threshold  (LT)  behave  differently  or  similarly  over  dif¬ 
ferent  network  structures.  We  devised  two  rewiring  structures,  one  preserving 
in/out-degree  correlation  and  the  other  changing  in/out-degree  correlation  while 
both  preserving  their  in/out-degree  distributions,  and  analyzed  how  co-link  rate 
and  in/out-degree  correlation  affect  the  influence  degree  of  each  diffusion  model 
using  two  real  world  networks,  each  as  the  base  network  on  which  rewiring  is 
imposed.  The  results  of  the  theoretical  analysis  qualitatively  explain  the  empiri¬ 
cal  results,  and  the  findings  help  deepen  the  understanding  of  complex  diffusion 
phenomena. 

Keywords:  Information  diffusion,  network  structure,  influence  degree,  node  de¬ 
gree  distribution 


1  Introduction 

The  emergence  of  Social  Media  such  as  Facebook,  Digg  and  Twitter  has  provided  us 
with  the  opportunity  to  create  large  social  networks,  which  are  becoming  an  important 
medium  for  spreading  information.  Recently,  substantial  attention  has  been  devoted  to 
analyzing  and  mining  social  networks  from  the  point  of  information  diffusion  [14, 15, 
11, 19,2, 1, 16].  One  of  the  most  well  studied  problems  is  the  influence  maximization 
problem,  i.e.,  the  problem  of  finding  a  limited  number  of  influential  nodes  that  are 
effective  for  the  spread  of  information.  Many  algorithms  have  been  proposed  to  solve 
the  problem  using  probabilistic  information  diffusion  models  on  a  network  [8, 12,5, 
9,6,4].  In  order  to  investigate  diffusion  phenomena  using  probabilistic  models,  it  is 
indispensable  to  understand  the  behavioral  differences  among  models,  and  provide  an 
effective  method  for  selecting  the  most  appropriate  model  for  a  particular  task  we  want 
to  analyze. 
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There  are  two  contrasting  fundamental  probabilistic  models  that  have  been  widely 
used  by  many  researchers.  One  is  the  independent  cascade  (IC)  model  [7,8]  and  the 
other  is  the  linear  threshold  (LT)  model  [18, 8],  The  IC  model  takes  a  sender-centered 
approach  such  that  each  information  sender  independently  influences  its  neighbors  with 
some  probability  ( information  push  style  model).  The  LT  model  is  a  receiver-centered 
approach  such  that  each  information  receiver  adopts  the  information  if  and  only  if  the 
number  of  its  neighbors  that  have  adopted  the  information  exceeds  some  threshold, 
where  the  threshold  is  treated  as  a  random  variable  ( information  pull  style  model).  We 
analyze  how  the  IC  and  the  LT  models  differ  from  or  similar  to  each  other  in  terms  of 
information  diffusion  for  a  wide  range  of  social  networks  with  different  structures. 

In  this  paper,  we  compare  influence  degree  obtained  by  the  IC  and  the  LT  models 
from  the  network  structure  perspective.  Here,  the  influence  degree  of  a  node  v  under  a 
probabilistic  diffusion  model  in  a  network  is  defined  to  be  the  expected  number  of  active 
nodes  at  the  end  of  the  information  diffusion  process  that  starts  from  the  initial  active 
node  v,  where  nodes  that  have  been  influenced  with  the  information  are  referred  to  as 
being  active.  First,  we  theoretically  analyze  the  properties  of  the  IC  and  the  LT  models 
on  scale-free  networks,  and  derive  the  following  two  properties:  1)  as  the  in/out-degree 
correlation  decreases,  the  influence  degree  decreases  for  the  IC  model  but  it  does  not 
change  for  the  LT  model  and  2)  as  the  co-link  (bidirectional  link)  rate  decreases,  the 
influence  degree  increases  for  both  the  IC  and  the  LT  models,  but  the  IC  model  is 
much  less  sensitive  than  the  LT  model.  To  verify  these  properties,  we  systematically 
generated  a  series  of  scale-free  networks  with  varying  in/out-degree  correlation  and  co¬ 
link  rate,  applying  two  rewiring  strategies,  one  preserving  in/out-degree  correlation  and 
the  other  changing  in/out-degree  correlation  while  both  preserving  their  in/out-degree 
distributions.  We  used  two  real  world  scale  free  networks  as  the  bases  to  apply  these 
strategies,  and  experimentally  confirmed  that  the  above  two  properties  indeed  hold. 


2  Diffusion  Models 

Let  G  =  ( V. ,  E)  be  a  directed  network,  where  V  and  E  (c  V  x  V)  are  the  sets  of  all  the 
nodes  and  links,  respectively,  and  |V|  <  \E\  can  be  naturally  assumed  for  commonly- 
seen  social  networks.  We  recall  the  definition  of  the  IC  and  the  LT  models  according 
to  the  literatures  [8,9].  In  these  models,  the  diffusion  process  proceeds  from  an  initial 
active  node  in  discrete  time-step  t  >  0,  and  it  is  assumed  that  nodes  can  switch  their 
states  only  from  inactive  to  active  (i.e..  the  SIR  setting). 

The  IC  model  has  a  diffusion  probability  pu  v  with  0  <  pu  v  <  1  for  each  link  (m,  v)  as 
a  parameter.  Suppose  that  a  node  u  first  becomes  active  at  time-step  t,  it  is  given  a  single 
chance  to  activate  each  currently  inactive  child  node  v,  and  succeeds  with  probability 
puv.  If  u  succeeds,  then  v  will  become  active  at  time-step  t  +  1.  If  multiple  parent  nodes 
of  v  first  become  active  at  time-step  f,  then  their  activation  trials  are  sequenced  in  an 
arbitrary  order,  but  all  performed  at  time-step  t.  Whether  u  succeeds  or  not,  it  cannot 
make  any  further  trials  to  activate  v;  in  subsequent  rounds.  The  process  terminates  if  no 
more  activations  are  possible. 

The  LT  model  has  a  weight  qUtV  (>  0)  with  EueB(v)  <lu,v  ^  1  for  each  link  (m,  v)  as  a 
parameter,  where  B(v)  =  {u  e  V;  (u,v)  e  E)  is  the  set  of  parent  nodes  of  node  v.  First, 


Effect  of  In/Out-Degree  Correlation  on  Influence  Degree 


3 


for  any  node  v  e  V,  a  threshold  6V  is  chosen  uniformly  at  random  from  the  interval  [0,1]. 
An  inactive  node  v  is  influenced  by  its  active  parent  nodes.  If  the  total  weight  from  v’s 
active  parent  nodes  at  time-step  t  is  no  less  than  6V ,  i.e.,  2b€B,(v)?«,v  ^  0V>  then  v  will 
get  activated  at  time-step  t  +  1.  Here,  B,(v)  is  the  set  of  all  the  parent  nodes  of  v  that  are 
active  at  time-step  t.  The  process  terminates  if  no  more  activations  are  possible. 


3  Analysis  of  Local  Influence  Degree 


We  first  define  local  influence  degree  of  node  u,  denoted  by  crL(u),  as  the  expected 
number  of  u's  child  nodes  directly  activated  by  u.  For  the  IC  model,  <t£c(k)  is  given  by 
crI7{u)  =  pu,v  where  F(u)  stands  for  the  set  of  u’s  child  nodes  defined  by  F(u)  — 

{v  e  V;  (m,  v;)  e  E}.  For  the  LT  model  cr7T  (u)  is  given  by  cr7T(u)  =  XveF(u)  du,v  because 
each  weight  qluv  is  regarded  to  be  the  probability  that  the  threshold  6V  is  chosen  from 
the  interval  [0,  quv] .  Then,  we  can  calculate  the  average  local  influence  degree  over  all 
nodes,  denoted  by  crj,(G).  For  the  LT  model,  if  we  impose  the  condition  Xucbiv)  <?«,v  =  1 
for  any  node  v  e  V,  we  can  prove  &IjT (G)  =  1  from  the  following  relations. 


(G)  -  |V|  Z  °^(m)  ”  |V|  Z  Z  q,,-v  -  |V|  Z  q“’v  -  I V|  Z  Z 

I  I  ,«y  I  1  u£v  veF(u)  '  '  1  1  ^ 


veV  ueB(v) 


For  the  IC  model,  if  we  impose  the  uniform  diffusion  probability  setting,  i.e.,  pu  v  =  p 
for  any  link  (u,  v)  e  E,  which  has  been  employed  in  many  previous  studies  (e.g.,  [8]), 
we  can  calculate  &I7(G)  as  follows: 


Z  °^C(m) 

ueV 


1 

W\ 


ZZ 

ueV  veF(u ) 


z 

(u,v)eE 


p  = 


\E\ 


where  {||  is  equal  to  the  average  degree  d  -  ^  \B(u)\  =  ^  Xuev  \F(u)\  -  jf},  and  is 

no  less  than  1  as  we  assume  |  V|  <  \E\.  Thus,  by  setting  the  uniform  diffusion  probability 
to  the  inverse  of  average  degree,  i.e.,  p  =  ^  =  jZj,  we  obtain  cr^c(G)  =  1.  This  makes  the 
IC  and  LT  models  equivalent  in  terms  of  the  average  local  influence  degree.  Hereafter, 
we  impose  these  settings  to  evaluate  the  influence  degree.  Note  that  local  influence 
degree  of  node  u  for  the  IC  model  becomes  <x[c(m)  =  EveFM  pu,v  =  p\F(u)\. 

So  far  we  focused  on  local  influence  degree  of  node  it  e  V  under  the  condition  that 
the  node  u  has  become  active.  However,  when  considering  the  cascade  of  information 
diffusion,  we  need  to  consider  the  probability  r{u)  that  the  node  u  is  activated  by  its  par¬ 
ent  nodes.  Namely,  we  consider  cascading  local  influence  degree  defined  by  ctcl(m)  = 
r(u)crL(u).  As  the  simplest  case,  we  employ  the  probability  r(u)  that  the  node  u  is  acti¬ 
vated  at  the  next  time  step  by  some  active  node  selected  uniformly  at  random  from  the 
node  set  V.  For  the  IC  model,  r,c(u)  is  given  by  r,c(u)  =  ^  2.sefi(«)  Ps.u  =  and 

for  the  LT  model,  rLT(u)  is  given  by  rLT (u )  =  ^  Yjsebiu)  ds.u  =  jpp  Thus  we  obtain  the 
average  cascading  local  influence  degree  &cl  for  the  IC  and  LT  models  as  follows: 

&cl(G)  =  ^  Z  =  Li  Z  \BMWF(u)\,  (1) 

1  1  ueV  11  ueV 

=  j^f  Z  =  j^jl  Z  <(«)  =  j^f- 

1  1  ueV  1  1  ueV  11 


(2) 
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Therefore,  by  noting  that  the  in/out-degree  correlation  dci/o(G )  is  quantified  by 

4j  Zi,ev\B(uW(u)\  ~  d2 

dcI/0(G)  =  11  _ 

ZueV  \B(u)\2  -  d 2  ^  Zuev  TOP  -  d 2 

and  the  denominator  of  dcj/o(G)  is  determined  by  the  standard  deviations  of  in/out- 
degree  distributions,  we  can  see  that  the  average  cascading  local  influence  degree  of 
the  IC  model  is  affected  by  the  in/out-degree  correlation  dci/o(G)  when  the  standard 
deviations  are  fixed,  as  shown  in  Eq.  (1),  while  that  of  the  LT  model  is  not  affected,  as 
shown  in  Eq.  (2).  Namely,  we  can  conjecture  that  influence  degree  of  the  IC  model  also 
decreases  when  the  in/out-degree  correlation  decreases. 

Another  important  factor  affecting  influence  degree  is  the  co-link  rate  cr(G )  which 
is  defined  by  cr(G )  =  ^  Zuev  \B(u)C\F(u)\.  Evidently,  for  a  bidirectional  network  G,  we 
obtain  cr(G )  =  1  because  B(u)  =  F(u)  for  any  u  6  V.  Assume  a  node  v  6  B(u)  n  F{u)\ 
if  v  succeeds  activating  m,  then  the  reverse  link  (n,  v)  never  contributes  to  increasing 
an  active  node,  conversely,  if  u  succeeds  activating  v,  then  the  reverse  link  (v,  u)  never 
does  so.  Thus,  we  conjecture  that  influence  degree  of  the  IC  and  LT  model  increases 
when  the  co-link  rate  cr(G)  decreases.  However,  there  is  a  subtle  difference  between 
the  IC  and  the  LT  models.  Think  of  the  network  with  co-link  rate  close  to  1 .  Evidently 
the  in/out-degree  correlation  is  also  close  to  1 .  Assume  that  k  parents  of  a  node  v  which 
has  a  large  degree  D  =  |F(v)|  =  \B(v)\  get  activated.  The  expected  probability  that  the 
node  v  becomes  activated  is  1  -  (1  -  l/d)k  for  the  IC  model  and  k/D  for  the  LT  model 
where  d  is  the  average  node  degree.  For  the  IC  model  the  probability  is  large  for  a  small 
number  of  k  and  insensitive  to  |D|.  Thus,  once  it  gets  activated,  the  reverse  k  links  which 
do  not  contribute  further  activation  is  small.  On  the  other  hand,  for  the  LT  model  the 
node  v;  is  not  activated  unless  k  is  large.  Thus,  once  it  gets  activated,  the  reverse  k  links 
do  not  contribute  further  activation  is  also  large.  This  implies  that  the  IC  model  is  less 
sensitive  to  the  change  of  co-link  rate  than  the  LT  model. 

4  Experiments 

To  confirm  our  conjectures  in  Section  3,  we  conducted  extensive  experiments  using 
both  synthetic  and  real  world  large  networks,  rewiring  their  links  according  to  the  two 
strategies  presented  in  this  section.  However,  due  to  the  page  limitation,  we  show  only 
the  results  for  the  two  real  world  networks:  one  bidirectional  and  the  other  directional1 . 

4.1  Rewiring  Strategies 

We  devised  two  rewiring  strategies.  Both  preserve  the  in/out-degree  distribution.  The 
first  one  rewires  links  of  a  given  network  G  preserving  the  in/out-degrees  of  each  node, 
which  is  equivalent  to  the  method  of  generating  randomized  networks  presented  in  [13]. 
We  implemented  this  strategy  by  swapping  the  two  destination  nodes  v  and  v'  of  links 

1  The  networks  we  omitted  here  include  synthetic  networks  generated  by  the  BA  model  [3]  and 
the  CNN  model  [17],  and  four  other  networks  derived  from  the  real  world  data. 
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e  =  (n,  v)  and  e'  =  (m',v')  from  two  starting  nodes  u  and  u'.  The  links  are  chosen 
uniformly  at  random.  Obviously,  this  never  changes  dcpoiG),  but  does  change  cr(G). 
We  refer  to  this  rewiring  strategy  as  the  DCP  (in/out-Degree  Correlation  Preserved) 
method,  and  denote  the  network  G  rewired  by  this  method  by  dcpa{G),  where  a  is  the 
link  rewiring  probability,  i.e.,  v  of  e  and  v'  of  e'  are  swapped  with  the  probability  a.  The 
larger  a  is,  the  smaller  cr(G)  is.  Thus,  the  DCP  method  allows  us  to  investigate  how  the 
co-link  rate  affects  the  influence  degree  of  the  IC  and  the  LT  models.  The  second  one 
rewires  links  changing  the  in/out-degree  correlation.  This  is  to  confirm  our  conjecture 
that  the  in/out-degree  correlation  affects  the  influence  degrees  of  the  IC  model.  We 
implemented  this  by  swapping  £7(v),  all  the  incoming  links  to  a  node  v,  and  Ei(v'), 
all  the  incoming  links  to  a  node  V  with  a  probability  a.  Nodes  v  and  V  are  randomly 
chosen.  Namely,  £7(v)  becomes  {( u ,  v);  u  €  B(v')},  and  E;(v')  becomes  {(s,  v');  s  e  B(v)} 
after  swapping.  This  method  changes  the  in-degree  of  chosen  nodes  without  changing 
their  out-degree  while  preserving  the  in/out-degree  distributions  of  the  network  G.  We 
refer  to  this  method  as  the  DCU  (in/out-Degree  Correlation  Unpreserved)  method,  and 
denote  the  network  G  rewired  by  the  DCU  method  with  a  link  rewiring  probability  a 
by  dcua(G).  The  larger  a  is,  the  smaller  the  in/out-degree  correlation  is. 


4.2  Datasets  and  Network  Structure 


In  this  section,  we  explain  the  two  real  world  networks  for  which  we  present  the  exper¬ 
imental  results.  The  first  one  is  a  bidirectional  network  derived  from  the  Enron  Email 
Dataset  [10].  We  regarded  each  email  address  as  a  node,  and  constructed  a  bidirectional 
link  between  two  email  addresses  u  and  v  only  if  u  sent  an  email  to  v  and  received  an 
email  from  v.  After  that,  we  extracted  the  maximal  strongly  connected  component.  We 
refer  to  this  bidirectional  network  as  the  Enron  network,  which  has  4, 254  nodes  and 
44, 314  directed  links.  The  second  one  is  a  directional  network  derived  from  a  Japanese 
word-of-mouth  communication  site  for  cosmetics,  “@cosme”2,  where  each  user  page  is 
associated  with  fan  links.  A  fan  link  from  user  u  to  user  v  is  generated  if  user  v  registers 
user  u  as  his/her  favorite  user.  We  extracted  a  fan  network  from  @cosme  by  tracing  up 
to  ten  steps  in  the  fan  links  starting  from  a  randomly  chosen  user  in  December  2009. 
The  resulting  network  has  45,024  nodes  and  351,299  directed  links.  We  refer  to  this 
network  as  the  Cosme  network. 

For  these  networks,  we  investigated  the  influence  degree  <x(v)  of  each  node  v  of  the 
networks  dcpa(G)  and  dcua(G)  under  the  IC  and  the  LT  models,  varying  a  from  0.0 
to  1.0  by  0.1.  Note  that  dcpo.o(G)  =  dcuox)(G)  =  G.  The  influence  degree  cr(v)  was 
estimated  by  the  empirical  mean  of  the  number  of  active  nodes  obtained  from  10,000 
independent  runs  of  information  diffusion  based  on  the  bond  percolation  technique  [9], 
According  to  the  discussion  in  Section  3,  we  set  a  unique  value  p  =  l/d  to  every  puv 
for  the  IC  model.  Namely,  p  was  set  to  0.10  for  the  Enron  network,  and  0.13  for  the 
Cosme  network. 


2  http://www.cosme.net/ 
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(a)  The  change  of  In/Out- 
degree  correlation  and  Co-link 
rate. 


(b)  The  best  and  the  average  in¬ 
fluence  degrees  with  the  DCP 
method. 


(c)  The  best  and  the  average  in¬ 
fluence  degrees  with  the  DCU 
method. 


Fig.  1:  Experimental  results  for  the  Enron  network. 


Rewiring  probability  a 


(a)  The  change  of  In/Out- 
degree  correlation  and  Co-link 
rate. 


(b)  The  best  and  the  average  in¬ 
fluence  degrees  with  the  DCP 
method. 


(c)  The  best  and  the  average  in¬ 
fluence  degrees  with  the  DCU 
method. 


Fig.  2:  Experimental  results  for  the  Cosme  network. 


4.3  Experimental  Results 

Figures  la  and  2a  show  how  the  in/out-degree  correlation  dci/oiG)  and  the  co-link  rate 
cr(G)  of  a  given  network  G  change  with  the  two  rewiring  methods,  DCP  and  DCU,  for 
the  Enron  and  the  Cosme  networks,  respectively.  We  see  that  both  methods  work  just 
as  we  intended:  cr(G )  decreases  in  a  similar  fashion  for  both  the  DCP  and  the  DCU 
methods,  as  the  rewiring  probability  a  becomes  larger,  while  dcj/o(G)  does  not  change 
with  the  DCP  method,  but  it  does  decrease  similarly  to  cr(G)  with  the  DCU  method. 
Note  that  both  dci/o(G )  and  cr(G )  of  the  Enron  network  are  1.0  for  a  =  0.0  because  it 
is  bidirectional. 

Figure  lb  illustrates  how  the  DCP  method  affects  the  best  and  the  average  influence 
degrees  over  all  the  nodes  of  the  Enron  network.  As  we  expected,  both  influence  degrees 
of  the  LT  model  become  larger  as  the  rewiring  probability  becomes  larger,  and  the  co¬ 
link  rate  becomes  smaller.  The  influence  degrees  of  the  IC  model  does  not  seem  to 
increase,  but  indeed  they  slightly  increase  within  the  range  of  a  —  0.0  to  0.6  where  the 
co-link  rate  drastically  decreased.  This  qualitatively  supports  the  analysis  in  Section  3. 
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The  same  tendencies  can  be  found  in  the  result  for  the  Cosme  network  as  shown  in 
Fig.  2b.  We  also  observed  the  same  tendencies  for  the  other  networks  we  omitted  here. 

Figures,  lc  and  2c  show  how  the  DCU  method  affects  the  best  and  the  average 
influence  degrees  of  the  IC  and  the  LT  models.  Both  dcj/o(G )  and  cr(G)  decrease  with 
a.  This  imposes  two  conflicting  factors  for  the  IC  model,  but  the  effect  of  dci/o(G) 
surpasses  and  the  influence  degrees  of  the  IC  model  decrease  for  both  the  Enron  and 
the  Cosme  networks.  On  the  other  hand,  the  influence  degrees  of  the  LT  model  are 
affected  by  only  cr(G).  Thus,  they  increase  in  the  same  way  as  in  Figs,  lb  and  2b.  The 
same  observation  is  obtained  for  the  other  networks.  This  also  qualitatively  supports  the 
analysis  in  Section  3. 


5  Conclusion 


Understanding  how  information  diffuses  over  a  large  social  network  is  important  to  do 
any  kind  of  social  network  analysis,  but  it  is  difficult  because  actual  diffusion  depends 
on  both  the  diffusion  model  employed  and  the  properties  of  the  network  structure  over 
which  the  information  diffuses.  Independent  Cascade  (IC)  and  Linear  Threshold  (LT) 
models  have  been  used  widely  by  many  researchers.  Both  are  probabilistic  models  but 
have  contrasting  properties,  i. e. ,  information  push  (IC)  and  information  pull  (LT).  So¬ 
cial  networks  have  common  characteristics.  The  most  important  one  would  be  the  scale 
free  property.  There  can  be  many  structures  that  hold  this  property.  We  devised  two 
rewiring  strategies  that  can  systematically  transform  one  network  structure  to  another 
structure  preserving  the  scale  free  property,  one  preserving  in/out-degree  correlation 
(DCP  method)  and  the  other  changing  in/out-degree  correlation  (DCU  method).  Each 
strategy  was  successively  applied  with  different  probabilities  to  two  real  world  social 
networks,  generating  a  series  of  networks,  each  with  a  gradually  changing  structure.  We 
chose  co-link  rate  and  in/out-degree  correlation  as  the  two  parameters  that  characterize 
the  network  structure,  and  investigated  how  these  parameters  affects  the  influence  de¬ 
gree  of  the  two  models  (IC  and  LT).  The  major  new  findings  are  1)  the  IC  model  is 
sensitive  to  in/out-degree  correlation  and  the  influence  degree  is  positively  correlated 
to  it,  whereas  the  LT  model  is  insensitive  to  it  and  2)  Both  the  IC  and  the  LT  models 
are  negatively  correlated  to  co-link  rate,  but  its  dependency  is  much  less  sensitive  in  the 
IC  model.  These  properties  can  be  qualitatively  derived  by  the  theoretical  analysis  and 
verified  by  the  extensive  experiments  using  the  above  networks  as  well  as  others  not 
reported  in  this  paper.  These  findings  are  useful  in  deepening  our  understanding  of  the 
complex  information  diffusion  phenomena  over  a  social  network. 
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Abstract.  We  address  the  problem  of  visualizing  structure  of  bipartite  graphs 
such  as  relations  between  pairs  of  objects  and  their  multi-labeled  categories.  For 
this  task,  the  existing  spherical  embedding  method,  as  well  as  the  other  standard 
graph  embedding  methods,  can  be  used.  However,  these  existing  methods  either 
produce  poor  visualization  results  or  require  extremely  large  computation  time  to 
obtain  the  final  results.  In  order  to  overcome  these  shortcomings,  we  propose  a 
new  spherical  embedding  method  based  on  a  power  iteration,  which  additionally 
performs  two  operations  on  the  position  vectors:  double-centering  and  normal¬ 
izing  operations.  Moreover,  we  theoretically  prove  that  the  proposed  method  al¬ 
ways  converges.  In  our  experiments  using  bipartite  graphs  constructed  from  the 
Japanese  sites  of  YahoolMovies  and  Yahoo! Answers,  we  show  that  the  proposed 
method  works  much  faster  than  these  existing  methods  and  still  the  visualization 
results  are  comparable  to  the  best  available  so  far. 


1  Introduction 

Visualization  by  embedding  graphs  into  a  low  dimensional  Euclidean  space  plays  an 
important  role  to  intuitively  understand  the  essential  structure  of  graphs  (networks).  To 
this  end,  various  graph  embedding  methods  have  been  proposed  in  the  past  that  include 
multi-dimensional  scaling  [5],  spectral  embedding  [1],  spring  force  embedding  [2], 
cross-entropy  embedding  [7].  Each  method  has  its  own  advantages  and  disadvantages. 

In  this  paper,  we  address  the  problem  of  visualizing  structure  of  bipartite  graphs 
such  as  relations  between  pairs  of  objects  and  their  multi-labeled  categories.  More 
specifically,  relations  of  this  kind  include  pairs  of  movies  and  their  associated  genres. 
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pairs  of  persons  and  their  interested  genres,  pairs  of  researchers  and  their  coauthoring 
papers,  pairs  of  words  and  their  appearing  documents,  and  many  more.  Clearly,  we  can 
straightforwardly  apply  any  one  of  the  above-mentioned  embedding  methods  for  the 
visualization.  However,  we  note  that  these  standard  methods  have  an  intrinsic  limita¬ 
tion  because  they  cannot  make  much  use  of  the  essential  structure  of  bipartite  graphs. 
Indeed,  the  existing  spherical  embedding  method  has  been  proposed  for  the  purpose  of 
visualizing  bipartite  graphs  [6],  In  this  method,  the  position  vectors  are  embedded  on 
two  concentric  spheres  (circles)  with  different  radii.  We  consider  that  such  a  spherical 
embedding  can  be  a  natural  representation  for  bipartite  graphs.  However,  the  biggest 
problem  with  the  existing  method  is  that  it  often  requires  an  extremely  large  computa¬ 
tion  time  to  obtain  the  final  visualization  results. 

In  this  paper,  to  overcome  these  shortcomings,  we  propose  a  new  spherical  em¬ 
bedding  method  based  on  a  power  iteration,  which  adopts  two  operations  to  iteratively 
adjust  the  positioning  vectors:  double-centering  and  normalizing  operations.  We  further 
show  theoretically  that  the  convergence  of  the  proposed  algorithm  is  always  guaranteed. 
In  our  experiments  that  use  bipartite  graphs  constructed  from  the  Japanese  sites  of  Ya- 
hoolMovies  and  Yahoo! Answers,  we  show  that  the  proposed  method  works  much  faster 
than  these  existing  methods,  and  yet  the  visualization  results  are  comparable  to  the  best 
available  so  far. 

This  paper  is  organized  as  follows.  We  first  describe  the  problem  framework  of 
embedding  bipartite  graphs  into  a  low  dimensional  Euclidean  space  in  Section  2.  Next 
we  describe  our  proposed  algorithm,  and  prove  that  this  algorithm  always  converges  in 
Section  3.  Then  we  experimentally  evaluate  the  proposed  method  by  comparing  it  with 
the  existing  embedding  methods  in  terms  of  both  the  efficiency  of  the  algorithms  and 
ease  of  the  interpretability  of  visualization  results  in  Section  4.  We  last  summarize  the 
main  conclusion  in  Section  5. 


2  Problem  Framework 

We  describe  the  problem  framework  of  embedding  the  bipartite  graph  G  =  (V,  E)  into  a 
X-dimensional  Euclidean  space,  where  V  =  Va  U  Vb,  Va  n  Vb  -  0,  and  E  c  Va  X  Vb- 
For  the  sake  of  technical  convenience,  we  identify  each  set  of  the  nodes,  V a  and  Vb, 
by  two  different  series  of  positive  integers,  i.e.,  Va  =  {1, - •  -  , m,  ■■■  ,M )  and  Vb  = 
{1,  -  -  •  ,«,•••  ,  N}.  Here  M  and  N  are  the  numbers  of  the  nodes  in  Va  and  Vb  ,  i.e., 
\Va\  -  M  and  \Vb\  =  N,  respectively.  Then,  we  can  define  the  M  x  N  adjacency  matrix 
A  =  {amn}  by  setting  amn  —  1  if  ( m,n )  e  E;  a„h„  =  0  otherwise.  We  denote  the  K- 
dimensional  embedding  position  vectors  by  x„,  for  the  node  m  e  V a  and  y„  for  the 
node  n  e  Vb-  Then  we  can  construct  M  x  K  and  N  x  K  matrices  consisting  of  these 
position  vectors,  i.e.,  X  =  (xi,  •  •  •  Xm)t  and  Y  =  (jq,  •  •  •  y n)t .  Here  Xr  stands  for  the 
transposition  of  X. 

According  to  the  work  on  the  existing  spherical  embedding  method  [6],  we  explain 
the  framework  of  spherical  embedding  of  bipartite  graph.  In  Fig.  1,  we  show  an  example 
in  a  two-dimensional  Euclidean  space,  i.e.,  unlike  the  standard  visualization  scheme 
shown  in  Fig.  la,  we  consider  locating  the  position  vectors  on  two  concentric  spheres 
(circles)  as  shown  in  Fig.  lb.  We  believe  that  this  kind  of  spherical  embedding  is  natural 
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Va  vb 

(a)  Bipartite  Graph 


(b)  Spherical  Embedding 


Fig.  1 :  Spherical  Embedding  for  Bipartite  Graph 


to  represent  bipartite  graphs,  and  its  usefulness  has  been  reported  [6],  Hereafter,  we 
assume  that  nodes  in  subset  V\  are  located  on  the  inner  circle  0A  with  radius  rA  -  1 , 
while  nodes  in  Vb  are  located  on  the  outer  circle  8g  with  radius  rH  =  2.  Note  that 
||x„,||  =  1,  ||y„ |[  =  2.  Then,  our  aim  is  to  locate  the  position  vectors  of  the  nodes  having 
similar  connection  patterns  closely  to  each  other. 


3  Proposed  Method 

3.1  Proposed  Algorithm 

The  new  spherical  embedding  method  is  based  on  a  power  iteration.  It  has  two  opera¬ 
tions  on  the  positioning  vectors  which  we  call  double-centering  operation  and  normal¬ 
izing  operation.  In  order  to  describe  our  algorithm,  we  need  to  introduce  the  centering 
matrices  and  normalizing  operations.  The  centering  (Young-Householder  transforma¬ 
tion)  matrices  are  defined  as  H*  =  I m  ~  Hy  =  I;v  -  where  IM  and 

I,v  stands  for  M  x  M  and  N  x  N  identity  matrices,  respectively,  and  1  m  and  1  ,v  are 
M-  and  /V-dimensional  vectors  whose  elements  are  all  one.  Clearly,  the  mean  vector  of 
the  resulting  position  vectors  becomes  0  by  the  operations  H^X  and  H  VY.  On  the  other 
hand,  the  normalizing  operations  are  defined  as  A M(X)  =  rAdiag(XXrr1/2X,  AW(Y)  = 
rBdiag(YYr)~1/2Y,  where  diag(-)  is  an  operation  to  set  all  the  non-diagonal  elements  to 
zero,  i.e.,  diagtXX7  )  is  a  diagonal  matrix  whose  m- th  element  is  x!mxm. 

Intuitively,  the  basic  procedure  of  our  proposed  algorithm  is  that  the  position  vector 
xm  is  repeatedly  moved  to  the  position  calculated  by  adding  the  position  vectors  { v„ ) 
that  are  connected  to  x,„.  Of  course,  we  need  to  perform  a  normalizing  operation  so  as 
to  satisfy  the  spherical  constraints.  Below  we  describe  our  proposed  algorithm. 

1.  Initialize  the  matrix  X  and  Y. 

2.  Update  the  matrix  X  <—  AM(H  v,/AH:vY). 

3.  Update  the  matrix  Y  <—  A,y(H,yA7  H  wX). 

4.  Terminate  if  the  changes  for  the  position  vectors  X  and  Y  are  small. 

5.  Return  to  the  step  2. 
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As  the  basic  framework,  our  proposed  algorithm  employs  a  power  iteration,  just  like  the 
HITS  algorithm  [3],  which  utilizes  A  and  A7  ,  does.  However,  the  main  differences  are 
use  of  the  double-centering  operations  by  HM  and  H  v  and  the  normalizing  operations 
by  A m(-)  and  A,y(-).  Here  note  that  the  double-centering  operation  is  also  employed  in 
the  standard  multidimensional  scaling  method  [5], 

Now  we  briefly  mention  the  computational  complexity  of  our  algorithm.  Clearly, 
the  main  computational  complexity  of  one-iteration  comes  from  the  multiplication  by 
the  matrix  A  (or  Ar)  which  is  the  most  intensive  part  and  is  proportional  to  the  num¬ 
ber  of  links  in  the  bipartite  graph.  Thus,  the  proposed  algorithm  is  expected  to  work 
much  faster  especially  for  a  sparse  bipartite  graph,  compared  with  the  existing  spheri¬ 
cal  embedding  algorithm  that  require  a  nonlinear  optimization  just  like  a  spring  force 
embedding  [2]  does.  In  fact,  it  has  been  well  known  that  the  PageRank  algorithm  based 
on  a  power  iteration  works  very  fast  for  a  large  and  sparse  network  [4], 

3.2  Convergence  Proof 

We  prove  the  convergence  property  of  the  algorithm.  To  do  this,  we  first  introduce 
the  following  double -centered  matrix  B  =  that  is  calculated  from  the  adjacency 

matrix  A  as  B  =  IIv/AHy.  Then,  by  using  the  matrix  B,  we  can  consider  the  following 
objective  function  with  respect  to  the  position  vectors  X  =  (xi,---  ,x«)7  and  Y  = 
(yt»  -  •  -  ,yN)T- 


M  N  T  1  M  1  N 

^  ^  bm,n  —  —  +  ^  X  Am(r*  ~  X'"X,h)  +  T  X  -  y»  y «).  (!) 

m= 1  n=  1  m_  |  «=1 


where  {Am  \  m  =  1 ,  •  •  •  ,  M }  and  j/r„  |  n  =  1 ,  •  •  •  ,  N)  correspond  to  Lagrange  multipliers 
for  the  spherical  constraints,  i.e.,  x7'x,„  =  r~  and  y7  y„  =  r7  for  1  <  m  <  M  and 
1  <  n  <  N. 

Now  we  consider  maximizing  /(X,  Y)  defined  in  Equation  (1)  by  use  of  a  coordinate 
strategy.  Note  that  maximizing  /(X,  Y)  pushes  the  pairs  x„,  and  ym  to  the  same  direction 
if  they  are  connected  and  pushes  them  to  the  opposite  direction  if  they  are  unconnected, 
and  realizes  the  intended  visualization.  We  repeat  the  following  two  steps:  maximizing 
/(X,  Y)  with  respect  to  X  by  fixing  the  matrix  Y  first,  and  maximizing  /(X,  Y)  with 
respect  to  Y  by  fixing  the  matrix  X  next.  If  the  maximization  of  these  steps  are  achieved 
by  the  above  algorithm’s  step  2  and  3,  respectively,  we  can  guarantee  the  convergence 
of  our  proposed  algorithm. 

In  order  to  confirm  these  facts,  we  consider  the  following  gradient  vector  of  the 
objective  function  /(X,  Y)  with  respect  to  xm. 


dJ(X,  Y) 
<9xm 


1 

DiDi 


y  n  A, 


(2) 


Thus,  for  a  fixed  matrix  Y,  we  obtain  the  optimal  position  vector  xm  which  maxi¬ 
mizes  the  objective  function  /(X,  Y)  as  xm  =  pr]jXm,  xm  =  X^Li  b ,„,„}/«■  Here  note 
that  the  optimal  vector  xm  is  calculated  by  using  the  matrix  Y  only.  Thus,  for  m  = 
1,  •  •  •  ,  M,  by  using  the  normalizing  operation  A,w(-)  whose  diagonal  elements  become 
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1  Science  Fiction/Fantasy 

•  red  circle 

9  Documentary 

►  olive  triangle-right 

2  Action/Adventure 

■  black  square 

10  Drama 

x  lime  cross 

3  Animation 

♦  green  diamond 

11  Family 

+  darkgold  plus 

4  Comedy 

★  blue  star 

12  Horror 

*  darkcyan  asterisk 

5  Suspense 

♦  maroon  hexagon 

13  Musical 

•  magenta  circle 

6  Teen 

A  orange  triangle-up 

1 4  Romance 

cyan  square 

7  Western 

▼  purple  triangle-down 

1 5  Special  Effects 

4  yellow  diamond 

8  War 

4  navy  triangle-left 

1 6  Others 

★  gray  star 

Fig.  2:  category  names  in  Japanese  YahoolMovies  site 


Gt/Pill,  •  •  •  *  Gi/PmII,  we  can  obtain  the  following  solution  in  the  vector-matrix  repre¬ 
sentation. 


X  =  Am(BY)  =  Am(HmAHwY).  (3) 

Recall  that  Equation  (3)  performs  centering  the  matrix  Y  by  the  matrix  H#,  multiplies 
the  adjacency  matrix  A,  performs  re-centering  the  matrix  by  multiplying  the  matrix 
H m,  and  normalizes  so  as  to  guarantee  spherical  constraints  (with  radius  rA).  By  this 
formula,  we  can  obtain  the  optimal  solution  of  position  vectors  X  by  fixing  the  matrix 

Y. 

Similarly,  we  can  also  obtain  the  following  optimal  solution  of  position  vector  y„  by 
fixing  the  matrix  X  as  y„  =  j^y,,,  y„  =  2m=t  V«xm-  Thus,  for  n  =  1,  •  •  •  ,  N,  by  using 
the  normalizing  operation  A^(-)  whose  diagonal  elements  become  rg/||yi||,  ■  •  •  ,  r/(/||y,y||, 
we  can  obtain  the  following  solution  in  the  vector-matrix  representation. 

Y  =  Aw(BrX)  =  Aw(HwArHMX).  (4) 

Therefore,  since  the  finite  objective  function  J(X,  Y)  defined  in  Equation  (1)  has  the 
analytical  optimal  solution  under  the  condition  that  either  X  or  Y  is  fixed,  and  is  always 
maximized  by  performing  the  step  2  and  3  of  the  algorithm,  we  can  guarantee  that  the 
algorithm  always  converges. 


4  Evaluation  by  Experiments 

4.1  Network  Data 

We  regard  the  movies  as  nodes  in  Vb ,  and  their  genres  as  nodes  in  VA  for  the  Japanese 
YahoolMovies  site  2.  Note  that  each  movie  is  associated  with  more  than  or  equal  to 
one  genre.  In  Fig.  2,  we  show  their  genre  names  used  in  our  experiments,  and  for  our 
visual  analyses  purpose,  we  assign  an  individual  marker  with  a  different  color  to  each 
genre  as  shown  in  this  figure.  In  order  to  evaluate  our  proposed  method  by  using  a 
set  of  different  bipartite  graphs,  we  classify  these  movies  into  7  groups  according  to 
their  release  dates(  1950-59,  1960-69,  1970-79,  1980-89,  1990-99,  2000-04  and  2005- 
09). Here  the  number  of  genres  is  \VA\  =  16  for  all  the  periods,  the  number  of  movies 
| VbI  are  594,  1079,  1314,  1805,  2659,  2948  and  3264,  and  the  number  of  links  |£j  are 
899,  1617,  2071,  2994,  4424,  6057  and  6564  for  each  period. 

2  http://movies.yahoo.co.jp/ 
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We  regard  the  users  who  answered  questions  as  nodes  in  Vb.  and  the  genres  of  these 
questions  as  nodes  in  V a  for  the  Japanese  Yahoo!Answers  site 3.  Note  that  although  each 
question  belongs  to  only  one  genre,  the  same  user  frequently  answers  several  questions 
belonging  to  a  wide  variety  of  genres.  Thus  we  can  obtain  bipartite  graphs  between  the 
pairs  of  the  users  and  the  genres  they  answered.  In  our  experiments,  we  utilized  a  set 
of  data  from  April,  2004,  to  October,  2005.  Again,  in  order  to  evaluate  our  proposed 
method  by  using  a  set  of  different  bipartite  graphs,  we  classify  these  questions  into  6 
groups  according  to  their  submission  dates().Here  the  number  of  genres  is  \Va\  —  10 
for  all  the  periods,  the  number  of  users  |VB|  are  11871,  27446,  35907,  39451,  42884 
and  46834,  and  the  number  of  links  |£j  are  30849,  80664,  96926,  95714,  102086  and 
1 12548  for  each  period. 

4.2  Brief  Description  of  Other  Visualization  Methods  used  for  Comparison 

We  first  explain  the  existing  spherical  embedding  method  as  our  primal  comparison 
method,  whose  problem  framework  is  the  same  to  ours.  In  this  method  the  follow¬ 
ing  objective  function  is  directly  minimized  with  respect  to  the  position  vectors  X  = 
(xi,  •  •  •  ,xM)T  and  Y  =  (yi,  •  •  •  ,yN)T  under  the  constraints  that  x'mxm  =  r\  and  y£y„  = 
r\  for  1  <  m  <  M  and  1  <  n  <  N.  The  objective  function  is  defined  as  2T(X,  Y)  = 

\  Em=i  El,  (cm*rArB  ~  xj£y„)2,  where  cm,„  =  2 am,n  -  1,  i.e.,  cmjl  =  1  if  ( m,n )  e  £; 
c,n,n  =  -1  otherwise.  In  order  to  obtain  the  solution  vectors,  this  method  repeatedly 
moves  each  position  vector  by  using  the  Newton  method  in  a  framework  of  nonlin¬ 
ear  optimization,  i.e.,  it  repeats  the  following  two  steps:  First,  minimizing  J/tX,  Y)  for 
xm  by  fixing  {xi,  •  •  -xM]  \  xm  and  {y i ,  •  •  ■  y^},  and  next  minimizing  2T(X,  Y)  for  y„  by 
fixing  {xi,  •  •  -  Xm)  and  {yi,  •  •  ■  y^}  \  yn.  Thus  this  method  requires  an  extremely  large 
computation  time  to  obtain  the  final  results. 

We  have  further  compared  the  proposed  method  with  the  four  well  known  embed¬ 
ding  methods:  multi-dimensional  scaling  [5],  spectral  embedding  [1],  spring  force  em¬ 
bedding  [2],  and  cross-entropy  embedding  [7],  Here  the  former  two  perform  a  power 
iteration  with  respect  to  either  a  double-centered  distance  matrix  or  a  graph  Laplacian 
matrix  which  is  calculated  from  a  given  graph,  just  like  our  proposed  spherical  embed¬ 
ding  method  does,  while  the  latter  two  repeatedly  move  each  position  vector  by  using 
the  Newton  method  in  a  framework  of  nonlinear  optimization,  just  like  the  existing 
spherical  embedding  method  does.  Note  that  these  four  methods  are  not  designed  for 
embedding  bipartite  graphs,  but  as  mentioned  earlier,  we  can  straightforwardly  apply 
them  for  our  purpose  because  a  bipartite  graph  is  regarded  as  an  instance  of  general 
undirected  graph. 

In  what  follows  in  this  subsection,  we  regard  a  bipartite  graph  as  an  undirected 
graph  G  =  (V,  E)  to  describe  the  basic  ideas  of  these  standard  embedding  methods, 
and  then  consider  a  framework  of  embedding  it  into  a  IT-dimensional  Euclidean  space. 
In  this  framework,  we  identify  the  set  of  the  nodes  by  a  positive  integer,  i.e.,  V  = 
{1,  •  •  •  ,/,•••  ,  L),  |V|  =  L  and  L  —  M  +  N .  Then,  we  can  define  the  Lx  L  adjacency 
matrix  A  =  {a„h„}  by  setting  a„h„  =  1  if  (m,  n)  e  E;  am>„  =  0  otherwise.  We  denote 
the  IT-dimensional  embedding  position  vectors  by  x„,  for  the  node  m  e  V,  and  then 

3  http://chiebukuro.yahoo.co.jp/ 
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construct  an  Lx  K  matrix  consisting  of  these  position  vectors,  i.e.,  X  =  (xi,  •  •  -Xg)T . 
We  also  denote  the  graph  distance  matrix  by  G  =  {g  each  element  of  which  is  the 
minimum  path  length  between  node  m  and  node  n. 

Multi-dimensional  scaling  method  [5]  first  calculates  the  distance  matrix  G,  and 
performs  the  double  centering  operation  (H/  =  I L  —  j-  lgl[) 

to  the  distance  matrix.  Mathematically  it  is  formulated  as  minimizing  A1(X)  = 

5  Ef=  i  GHl)z k,  where  zk  =  (xu,  ■■■  ,xL,k)T,  and  {zi,  —  ,zK }  need  to  be  or¬ 

thonormal  vectors,  i.e.,  zjzk  =  1  and  zfz*<  =  0  if  £  4-  k' .  Spectral  embedding  method  [1] 
tries  to  directly  minimize  distances  between  position  vectors  of  connecting  nodes.  Math¬ 
ematically  it  is  formulated  as  minimizing  <S(X)  =  Y,k=  l  (D  _  A)z k.  where  D  is  a  di¬ 
agonal  matrix  each  element  of  which  is  the  degree  of  node  (number  of  links).  Note  that 
(D-A)  is  referred  to  as  a  graph  Laplacian  matrix.  Again,  we  set  zk  =  (x\tk,  •  •  •  ,  xkj-)T, 
and  {z1( •  •  •  ,zK}  need  to  be  orthonormal  vectors,  which  excludes  the  trivial  vector  ex¬ 
pressed  as  z  oc  1  /  .  Spring  force  embedding  method  [2]  assumes  that  there  is  a  hypothet¬ 
ical  spring  between  each  connected  node  pair  and  locates  nodes  such  that  the  distance 
of  each  node  pair  is  closest  to  its  minimum  path  length  at  equilibrium.  Mathemati¬ 
cally  it  is  formulated  as  minimizing  7C(X)  =  2m=i  En=m+i  Q'm,n(gm,n  -  l|xm  -  xn||)2, 
where  am  n  is  a  spring  constant  which  is  normally  set  to  l/(2g2v).  Cross-entropy  em¬ 
bedding  method  [7]  first  defines  a  similarity  p(xm,x„)  between  the  embedding  posi¬ 
tions  xm  and  xn  and  uses  the  corresponding  element  amn  of  the  adjacency  matrix  as 
a  measure  of  distance  between  the  node  pair,  and  tries  to  minimize  the  total  cross 
entropy  between  these  two.  Mathematically  it  is  formulated  as  minimizing  C(X)  = 

-  Zm=i  E^m+1  (am,nlogp(xm,xn)  +  (1  -  am,n)  log(l  -  p(xu,xv))).  Here,  note  that  we 
used  the  function  p(xu,xv )  =  exp(-i||x„  -  xv||2)  in  our  experiments. 

4.3  Experimental  Results 

We  first  evaluated  the  efficiency  of  our  proposed  method  in  comparison  with  the  existing 
methods.  We  show  our  experimental  results  in  Fig.  3,  where  Spec,  MDS,  SF,  CE,  eSE 
and  pSE  stand  for  the  spectral  embedding,  multi-dimensional  scaling,  spring  force  em¬ 
bedding,  cross-entropy  embedding,  existing  spherical  embedding  and  proposed  spher¬ 
ical  embedding  methods,  respectively  (machine  used  is  Intel(R)  Xeon(R)  CPU  X5472 
@3.0GHz  with  64GB  memory).  Here  Figs.  3a  and  3b  correspond  to  the  results  by  using 
the  bipartite  graphs  constructed  from  the  Yahoo !Movies  and  Yahoo! Answers  sites,  re¬ 
spectively.  In  these  figures,  we  plotted  the  average  processing  time  (sec.)  over  10  trials 
by  changing  the  initial  position  vectors,  where  the  horizontal  and  vertical  axes  stand  for 
the  number  of  nodes  in  Vg  and  the  processing  times,  respectively.  Here  recall  that  the 
number  of  nodes  in  Vg  is  different  for  each  bipartite  graph  as  mentioned  above. 

As  expected,  these  figures  show  that  our  proposed  spherical  embedding  (pSE)  method 
works  much  faster  than  all  the  existing  methods  we  compared.  More  specifically,  the 
spectral  embedding  (Spec)  method  works  comparable  to  our  method.  This  is  because 
these  methods  perform  a  power  iteration  on  a  sparse  adjacency  matrix.  In  fact,  the  multi¬ 
dimensional  scaling  (MDS)  method  requires  a  substantially  large  computation  time  be¬ 
cause  it  needs  to  perform  a  power  iteration  on  a  full  distance  matrix.  All  the  other 
methods  including  the  existing  spherical  embedding  (eSE)  method,  which  repeatedly 
move  each  position  vector  by  using  the  Newton  method,  generally  require  an  extremely 
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(a)  Yahoo!Movies  (b)  YahoolAnswers 

Fig.  3:  Comparison  of  processing  times. 

large  computation  time  before  the  final  results  are  obtained.  Especially,  both  the  spring 
force  embedding  (SF)  and  cross-entropy  embedding  (SE)  methods  require  more  than 
three  days  to  obtain  the  final  results  even  for  one  trial  when  the  numbers  of  nodes  for 
the  YahoolAnswers  graphs  become  more  than  40,000;  thus  we  omitted  these  results  in 
Fig.  3b.  Here  we  should  emphasize  that  the  scale  of  the  vertical  axis  of  these  figures  is 
logarithmic. 

Next  we  evaluated  the  visualization  results  of  our  proposed  method  in  comparison 
with  the  existing  methods.  Due  to  a  space  limitation,  we  only  show  our  experimental 
results  obtained  for  a  bipartite  graph  constructed  from  the  Japanese  Yahoo!Movies  sites 
in  Fig.  4.  Here  recall  that  the  genre  information  has  been  shown  in  Fig.  2.  In  Fig.  4a, 
we  show  the  visualization  result  by  our  proposed  method,  which  we  consider  intuitively 
natural.  Actually,  we  can  see  that  the  genre  nodes  of  Action/Adventure  (black  square) 
and  Suspense  (maroon  hexagon)  are  located  in  near  positions  at  the  right-side  of  the 
inner  circle  (9a),  while  at  the  opposite  left-side  of  this  circle,  the  genre  nodes  of  Teen 
(orange  triangle_up)  and  Romance  (cyan  square)  are  located  in  near  positions.  Overall, 
we  can  observe  that  the  similar  genres  are  located  closely  on  the  inner  circle  (0a). 

Now  we  compare  the  above  results  with  the  five  existing  methods.  The  first  one  is 
the  visualization  result  by  the  existing  spherical  embedding  method  shown  in  Fig.  4b. 
We  see  that  there  are  several  minor  differences  but  we  consider  this  result  comparable  to 
the  result  by  our  method.  However,  this  one  is  very  slow  and  inefficient.  Our  method  is 
much  faster.  The  second  one  is  the  visualization  result  by  the  multidimensional  scaling 
method  shown  in  Fig.  4c.  We  can  observe  some  clusters  of  genres.  Although  this  result 
might  indicate  some  intrinsic  property,  we  feel  that  the  spherical  embedding  scheme 
is  a  more  natural  representation  of  bipartite  graphs.  The  third  one  is  the  visualization 
result  by  the  spectral  embedding  method  shown  in  Fig.  4d.  This  one  is  relatively  poor  in 
our  own  experiments.  In  fact,  the  two  genres  of  Drama  (lime  cross)  at  the  bottom-right 
and  Documentary  (Olive  triangle_right)  at  the  top-left  are  too  much  isolated,  although 
this  method  works  reasonably  fast  among  the  existing  methods. The  fourth  and  the  fifth 
ones  are  the  visualization  results  by  the  spring  force  embedding  method  and  the  cross¬ 
entropy  embedding  method  shown  in  Figs.  4e  and  4f.  We  can  observe  a  similar  tendency 
between  these  two,  e.g .,  we  can  easily  see  that  the  genre  node  of  Drama  (lime  cross)  is 
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much  isolated  in  both.  The  main  difference  in  these  methods  is  that  we  can  observe  that 
some  genre  nodes  are  clustered  for  the  spring  force  embedding  method,  but  there  are  no 
such  clusters  and  all  the  genres  are  scattered  for  the  cross-entropy  embedding  method. 
Overall,  although  each  embedding  method  might  have  its  own  characteristics  that  are 
both  advantageous  and  disadvantageous,  we  believe  that  our  proposed  spherical  embed¬ 
ding  method  is  most  effective  for  visualizing  bipartite  graphs  in  terms  of  efficiency  and 
interpretability. 

Last  but  not  least,  we  evaluated  our  proposed  method  only  in  the  case  of  two- 
dimensional  embedding  for  our  visualization  purpose,  but  this  does  not  mean  that  it 
is  limited  to  two-dimensional  embedding.  It  is  quite  easy  to  extend  it  to  the  general 
/('-dimension  embedding.  We  plan  to  evaluate  our  method  as  a  powerful  technique  for 
both  dimensional  reduction  and  clustering  as  a  future  work. 

5  Conclusion 

In  this  paper,  we  addressed  the  problem  of  visualizing  structure  of  bipartite  graphs  such 
as  relations  between  pairs  of  objects  and  their  multi-labeled  categories,  and  proposed  a 
new  spherical  embedding  method  that  is  based  on  a  power  iteration.  The  key  features 
of  this  method  is  that  it  employs  two  operations  on  the  positioning  vectors,  one  called 
double-centering  operation  and  the  other  called  normalizing  operation.  This  enables 
the  iterative  approach  to  be  equivalent  to  maximizing  an  objective  function  which  is 
guaranteed  to  converge.  Thus,  our  algorithm  is  theoretically  guaranteed  to  converge. 
We  applied  our  method  to  a  set  of  bipartite  graphs  with  different  sizes  and  connections 
which  were  constructed  from  the  Japanese  sites  of  YahooIMovies  and  Yahoo ’Answers, 
and  compared  the  results  with  five  existing  visualization  methods.  The  results  showed 
that  the  proposed  method  works  much  faster  than  all  the  five  existing  methods,  and  the 
visualization  results  are  intuitively  understandable  and  comparable  to  the  best  available 
so  far  known.  In  future,  we  plan  to  apply  the  new  method  to  evaluate  its  performance 
for  a  wide  variety  of  bipartite  graphs. 
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(a)  proposed  spherical  embedding 


(b)  existing  spherical  embedding 


(c)  multi-dimensional  scaling 


Fig.  4:  Visualization  Results  (Yahoo!Movies  1950  -  1959) 


